A Deep Learning Approach for the Romanized Tunisian Dialect Identification

A Deep Learning Approach for the Romanized

Tunisian Dialect Identification

                Jihene Younes1, Hadhemi Achour1, Emna Souissi2, and Ahmed Ferchichi1 

1Université de Tunis, ISGT, Tunisia

2Université de Tunis, ENSIT, Tunisia

Abstract: Language identification is an important task in natural language processing that consists in determining the language of a given text. It has increasingly picked the interest of researchers for the past few years, especially for code-switching informal textual content. In this paper, we focus on the identification of the Romanized user-generated Tunisian dialect on the social web. We segment and annotate a corpus extracted from social media and propose a deep learning approach for the identification task. We use a Bidirectional Long Short-Term Memory neural network with Conditional Random Fields decoding (BLSTM-CRF). For word embeddings, we combine word-character BLSTM vector representation and Fast Text embeddings that takes into consideration character n-gram features. The overall accuracy obtained is 98.65%.

Keywords: Tunisian dialect, language identification, deep learning, BLSTM, CRF and natural language processing.

Received August 25, 2019; accepted April 28, 2020

https://doi.org/10.34028/iajit/17/6/12
Full text     
Read 570 times Last modified on Tuesday, 01 December 2020 06:37
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…