MASHHOOR: A WEB-BASED FAMILIARITY DICTIONARY FOR ARABIC

Back to Page Authors: Attia Youseif, Khaled Elghamry

Keywords: computational Linguistics, corpus Linguistics, lexicography, web as corpus, Arabic

Abstract: Word familiarity is needed in language learning, conversational agents, translation and localization content, word perception and readability studies (Seraye 2016, Al-Khalifa et al 2010). Existing resources for Arabic are either subjective (Hasmam et al. 2016) or based on simple frequency dictionaries not taking into consideration the distribution of words in different types of corpora and in different Arabic-speaking countries (Buckwalter and Parkinson 2014). This paper presents Mashhoor (an Arabic word for ‘well-known’), a dictionary that provides words with their corpus-based familiarity scores (a la Nusbaum et al. 1984). Our suggested familiarity score is a function of the different aspects of word frequency and regional distribution in a large corpus, and the changes in these frequencies over time. This paper is part of a larger ongoing project “Thamaraat” (an Arabic word for fruits), concerned with developing automated methods and tools for constructing lexical resources that would reduce time and effort in syllabus design and meeting the everyday needs of both learners and teachers of Arabic. Our suggested methods and tools can be easily used to build similar resources for other languages. This paper uses corpora crawled from 623 Arabic-speaking websites. It was crawled such it covers all Arab countries. The preprocessing of the corpus was limited to removing encodings and punctuation. It contains 28.5 million documents and about 6.9 billion tokens. This corpus was used to compute the familiarity scores for each word form in the corpus, using a function based on the following values: [a] the overall frequency of the word in the corpus, [b] its overall frequency in the corpus from a given country, weighted proportionally to the population size of the country, [c] its frequency over time. To measure the different weights for words, we used TF-IDF (term frequency-inverse document frequency, a statistic used in information retrieval intended to reflect how important a word is to a document in a corpus. The output of applying our equation to the corpus is a dictionary with two main familiarity scores for each word: the first indicates the familiarity of the word within Arab country, and the other reflects its familiarity among Arabic speakers in general.. For example, our familiarity scores reveal that the word “صاحب” (fs: 0.97) is more familiar than the word “صديق” (fs: 0.86), both are Arabic synonyms for “friend”. One possible application of this dictionary is to help familiarize learners of Arabic with words that are common in different cultural contexts as well in different regions in the Arab world. It will also the teacher design the syllabus based on a clear measure of word commonality and familiarity, optimizing the process of aligning lexical difficulty and the proper proficiency level. Our plan is to extend our corpus and familiarity computation method to include data from Facebook and Twitter in the last 10 years, on the assumption that the frequency of words in these channels is a good indicator of its overall familiarity.