Word Usage Frequency
In learning a foreign language, one has to devote a large amount of time to memorizing vocabulary. There're many sources of most frequently used words in a given language on the Internet. Almost all of them list the words in alphabetic order, optionally grouped into some categories (nouns vs. verbs, or words for school vs. work place, etc). Although that sort order makes lookup easy, a human should ideally learn the words in the descending order of real-life usage. Sorting those words in usage frequency has pedagogical significance, to the students, teachers, and textbook authors.
The following is based on my earlier work that created Chinese Character Usage Frequency. Please refer to that article for additional details. In this article, I apply the same approach to various human languages in the world. First, I find a list of commonly used words (and occasionally phrases) from the Internet. Each word is submitted to a well-known search engine such as Yahoo. Its approximate hit count is then recorded. After all words are searched and hit counts obtained, words are sorted on hit count. This is a new method to generate human language word usage frequency. It's simple, fast, and easy to apply to any language. However, the shortcomings of this method are obvious, which are discussed below taking Spanish as an example. Also refer to my earlier work for peer review and critique.
The Spanish words are taken from Quizlet's 1000 Top Used Spanish Words. Inflected word endings are omitted because fully spelling the words would require human intervention; e.g., only "absoluto", "acción", are taken from "absoluto/a", "acción\ciones", respectively. Dropping the inflected word spellings is one source of negative error in the final result. Another minor problem is that some search engines cap the hit count of very common words to a limit; for instance, Yahoo caps it to 2147483647, and Baidu caps it to 100000000. Fortunately, these errors happen to only a small number of words and the overall word frequency thus compiled generally reflects their usage on the Internet. It should be noted that the Internet as the text corpus in this study has its disadvantage as well as advantage, compared to other types of corpus. This is discussed in the peer review section of my earlier article.
|Frequency List||Word Source||Word Count|
|Spanish (Yahoo), Spanish (Google)||Quizlet||1000|
|French (Yahoo)||Wiktionary:French frequency lists (first 1000)||1000|
Detailed procedure to generate the word usage frequency list
Find a list of the words in the target language, e.g. Spanish, from any source. Save the words in a text file, one word (or phrase) per line. Globally replace spaces with plus (+) signs. Download yahoo_word.php and modify WORDFILE in it as needed. Download and install PHP (actually only php.exe and php5ts.dll are needed on Windows). Install UnxUtils to get its sort.exe and put it at say d:\systools. Go to command console and run
php -n yahoo_word.php > result.txt search on Yahoo d:\systools\sort -t " " -nr -k 2 result.txt > SpanishWordFrequencyY.txt Note it's tab after -tFinal cosmetic editing of SpanishWordFrequencyY.txt can be done by Notepad, changing plus's back to spaces, etc. If you search on Google instead, use google_word.php. Warning: For obvious reason, it's better to not run this program in a corporate environment.
LinksChinese Character Usage Frequency