Frequency of Short Vowels in Arabic Words

Short vowels in Arabic words are not written and generally not marked. Arabic learners don't know whether the omitted vowel is a, u, or i unless they already know the word, or read a book such as a textbook in which the words are marked with short vowel diacritics (ḥarakāt) or spelled with Latin alphabet. If a word has been learned but is not firmly committed to memory, or was learned using Latin transliteration (romanization), the learner will have some difficulty recognizing the word spelled in Arabic script, mainly due to the unwritten short vowels. One strategy in "restoring" these vowels is to blindly guess letter a as the first choice, and if the sound thus constructed doesn't ring a bell, guess i, and if it fails again, then u. The following is a proof that the short vowels do occur in the a > i > u frequency order so that the first guess of a as empirically practiced is justified.

1. Counting short vowels in Latin transliteration

The easiest way to do this frequency counting is to use a list of words written in Arabic script along with Latin script.note1 The Wiktionary Arabic Frequency List from Quran 1-1000 does exactly that. We save the web page to a text file. Use a text editor such as vi to remove everything except the Latin transliteration, i.e., on each line, the string after the opening parenthesis but before the comma (for example, change the line "1. مِن‎ (min, preposition), 3226" to only "min"). Split each word to put each character on its own line (for example, "min" will be written as m, i, and n, each on one line). Save the file and name it, say, ArabicWordsSplitIntoLetters.txt. Count lines that have a, i, u with the grep command:note2

D:\temp>grep -c a ArabicWordsSplitIntoLetters.txt
D:\temp>grep -c i ArabicWordsSplitIntoLetters.txt
D:\temp>grep -c u ArabicWordsSplitIntoLetters.txt
D:\temp>grep -c e ArabicWordsSplitIntoLetters.txt
D:\temp>grep -c o ArabicWordsSplitIntoLetters.txt

As we can see, the short vowel a indeed occurs much more frequently than i and u, 828, 270, and 169 times, or 65%, 21%, and 13%, respectively, in the most frequently occurring 1000 words in the Holy Quran; incidentally, e and o are not in transliteration of any word (although phonetically some words are pronounced with sounds like having [e] or [u] or their allophones). We can check other word lists and will definitely get slightly different counts, but the frequency trend, i.e. a > i > u, is unlikely to change. This confirms the advantage of first guessing a as the short vowel, due to its overwhelmingly higher frequency than that of i and u combined.

Note that this method does not take into account the zero-vowel (sukūn), which is omitted in Latin transliteration. A more sophisticated method is needed to count them.

2. Counting short vowels in Unicode dump

We still begin with saving "Arabic Frequency List from Quran 1-1000" to a text file. This time we use an editor to remove everything except the Arabic words (for example, change the line "1. مِن‎ (min, preposition), 3226" to only "مِن‎"), and save the file as, say ArabicWords.txt, in Unicode encoding, which is an option in Windows Notepad; do not save it in any other encoding. We do a hexadecimal dump of the file with od program,note2 save all the dumped fields except line headers to, say, ArabicWordsUnicodeLetters.txt, use an editor such as gvim to replace all spaces with carriage returns and save the file to, say, ArabicWordsUnicodeLettersEachOnOneLine.txt, and count the occurrencies of Arabic fatha which appends a short vowel a to the preceding consonant and which has a Unicode of 064e, occurrencies of kasra (Unicode 0650) which gives us a short vowel i , occurrencies of damma (Unicode 064f) which gives a short vowel u, and occurrencies of sukun (Unicode 0652) for zero-vowels. The following are the commands to run, with my comments in italic.

D:\temp>od -x ArabicWords.txt | gawk "{print $2,$3,$4,$5,$6,$7,$8,$9}" > ArabicWordsUnicodeLetters.txt
D:\temp>gvim ArabicWordsUnicodeLetters.txt <- save as ArabicWordsUnicodeLettersEachOnOneLine.txt
D:\temp>grep -c 064e ArabicWordsUnicodeLettersEachOnOneLine.txt <- fatha for 'a' sound
D:\temp>grep -c 0650 ArabicWordsUnicodeLettersEachOnOneLine.txt <- kasra for 'i' sound
D:\temp>grep -c 064f ArabicWordsUnicodeLettersEachOnOneLine.txt <- damma for 'u' sound
D:\temp>grep -c 0652 ArabicWordsUnicodeLettersEachOnOneLine.txt <- sukun for zero-vowel
D:\temp>grep -c 0651 ArabicWordsUnicodeLettersEachOnOneLine.txt <- shadda for double consonant

The last command above also shows how many times shadda occurs in the document. In fact you can count any character by its Unicode. In any case, we can see that short vowel a still occurs more often than i and u combined; their probability ratios are 62%:25%:13%. With this method, we can count the occurrences of sukun or zero-vowel, which has a higher frequency than that of u but less than that of i. If we include sukun in the ratios, we now know an unwritten short vowel (or no vowel at all) has the ratios of a:i:u:sukun of 52%:21%:11%:16%. This means that in guessing a short vowel, while it still makes sense to guess a first and i second, it may be better to leave the consonant "bare" next, without u sound, to see if the constructed word is right, before you try u.

note1 For our method to work, the Latin transliteration must not use ASCII characters (with ASCII codes less than 128) to represent long vowels. For example, aleph ا can be transliterated as ā but not aa, as done by some books such as J. Smart et al. Teach Yourself Gulf Arabic, K. Brustad et al. Alif Baa, etc.
note2 The grep, od, and gawk programs are not natively available on Windows but you can download and install them, or do this work on Linux. If you only want to count the vowels in Latin transliteration, you can use the Windows find command instead of grep, e.g. find /c "a" ArabicWordsSplitIntoLetters.txt.


Contact me

To my Miscellaneous Page