Word Usage Frequency

In learning a foreign language, one has to devote a large amount of time to memorizing vocabulary. There're many sources of most frequently used words in a given language on the Internet. Almost all of them list the words in alphabetic order, optionally grouped into some categories (nouns vs. verbs, or words for school vs. work place, etc). Although that sort order makes lookup easy, a human should ideally learn the words in the descending order of real-life usage. Sorting those words in usage frequency has pedagogical significance, to the students, teachers, and textbook authors.

The following is based on my earlier work that created Chinese Character Usage Frequency. Please refer to that article for additional details. In this article, I apply the same approach to various human languages in the world. First, I find a list of commonly used words (and occasionally phrases) from the Internet. Each word is submitted to a well-known search engine such as Yahoo. Its approximate hit count is then recorded. After all words are searched and hit counts obtained, words are sorted on hit count. This is a new method to generate human language word usage frequency. It's simple, fast, and easy to apply to any language. However, the shortcomings of this method are obvious, which are discussed below taking Spanish as an example. Also refer to my earlier work for peer review and critique.

The Spanish words are taken from Quizlet's 1000 Top Used Spanish Words. Inflected word endings are omitted because fully spelling the words would require human intervention; e.g., only "absoluto", "acción", are taken from "absoluto/a", "acción\ciones", respectively. Dropping the inflected word spellings is one source of negative error in the final result. Another minor problem is that some search engines cap the hit count of very common words to a limit; for instance, Yahoo caps it to 2147483647, and Baidu caps it to 100000000. Fortunately, these errors happen to only a small number of words and the overall word frequency thus compiled generally reflects their usage on the Internet. It should be noted that the Internet as the text corpus in this study has its disadvantage as well as advantage, compared to other types of corpus. This is discussed in the peer review section of my earlier article.

Result

Frequency ListWord SourceWord Count
Spanish (Yahoo), Spanish (Google)Quizlet1000
French (Yahoo)Wiktionary:French frequency lists (first 1000)1000

Appendix

Detailed procedure to generate the word usage frequency list

Find a list of the words in the target language, e.g. Spanish, from any source. Save the words in a text file, one word (or phrase) per line. Globally replace spaces with plus (+) signs. Download yahoo_word.php and modify WORDFILE in it as needed. Download and install PHP (actually only php.exe and php5ts.dll are needed on Windows). Install UnxUtils to get its sort.exe and put it at say d:\systools. Go to command console and run

php -n yahoo_word.php > result.txt	search on Yahoo
d:\systools\sort -t "	" -nr -k 2 result.txt > SpanishWordFrequencyY.txt	Note it's tab after -t
Final cosmetic editing of SpanishWordFrequencyY.txt can be done by Notepad, changing plus's back to spaces, etc. If you search on Google instead, use google_word.php. Warning: For obvious reason, it's better to not run this program in a corporate environment.

Links

Wiktionary:Frequency lists/Spanish1000 Excellent Spanish word frequency list generated from subtitles of movies and television series
Chinese Character Usage Frequency
Contact me

To my Miscellaneous Page