Word Frequency List

Word Frequency List

Both of my books, Learning Spanish Words Through Etymology and Mnemonics and Learning French Words Through Etymology and Mnemonics, order the headwords in word usage frequency. While the frequency of word occurrences in a corpus (collection of written texts) is a simple concept, there are interesting issues on this topic. For example, (1) should various conjugated forms of verbs and declined forms of nouns be included or only the lemmas (canonical forms i.e. infinitives of verbs, singular nominative masculine nouns, etc.) be included? (2) What are the advantages and disadvantages of the sources of mostly books vs. movie subtitles? (3) How is the frequency found?

Learning Spanish Words goes by the frequency list of RAE, Real academia española (Royal Spanish Academy). There's no doubt about the authority of this prestigious institution. Unfortunately, their frequency list includes all conjugations and plurals and my book suffers from ambiguity in frequency ranking after lemmatization (finding the lemma from an inflected word form). As I went on with writing my book, I slowly came to realize such difficulty. For example, I'm pretty sure queda (conjugated form of quedar "to remain"; "curfew") should not be included as a separate headword for the meaning of "curfew" and be given a high frequency in my book. It was a mistake and my planned revision of the book will delete it or move it to a much later page with a much lower frequency.

My Learning French Words suffers from a problem related to question (2). It uses the Lexique frequency list and follows its freqlemlivres order, i.e. la fréquence du lemme selon le corpus de livres (lemma frequency according to the corpus of books). A few months after I started on this book project, I posted a question to their web forum, asking why some words appear in frequency positions quite different from our common sense dictates. The forum moderator and probably one of the owners of Lexique told me freqlemlivres is not as good as freqlemfilms (lemma frequency according movie subtitles), which he recommends. (The web forum has been decommissioned and old messages are gone, even from archive.org, so I can't reference his words.) If I could start over, I might re-order according to freqlemfilms. After all, books mostly record written form. To capture both written and oral language, movie subtitles serve as a better source.

Now let's consider question (3). The traditional way to get word frequency is to get a large number of books, magazines and newspapers, movie and theater scripts, record the number of occurrences for each word and sort them. Well known lists that fall into this category are the Wiktionary lists (this or this), various frequency dictionaries on Amazon. But nowadays there are other ways. For example, many years ago I did something probably none in the world had attempted and will attempt: submit each word to Google or other search websites such as Yahoo or Baidu, record the approximate hit counts given by the website, and sort on the counts for the words. (See Word Usage Frequency and Chinese Character Usage Frequency.) The frequency values are implicitly given by the search sites. My script simply collects them. It reflects the word frequency on the Internet, or rather, the portion of the Internet indexed by the search engine.

And yet there is one more way to create a frequency list. Linguee is "an online bilingual concordance", says Wikipedia. But hidden in their web pages is a frequency list for various languages. You just have to go to a URL in this format, www.linguee.com/language-english/toplanguage/start#-end#.html to see the list for language, for the words in the start# to end# frequency range. For example, https://www.linguee.com/spanish-english/topspanish/1-200.html shows the first 200 most frequent (start# 1) Spanish words. Clicking the word gives you the dictionary entry for that word, unless the range spans more than 1000. What's new to this list? The webpage says "Most common Spanish queries, 1 to 200" (my bold text). It means these words are the most searched by Linguee website visitors, not the most frequently occurring in a large corpus. This is a great innovation or enhancement to the traditional occurrence-based frequency lists because language learners do not make the same amount of effort to study a group of words that have about the same occurrence-based frequency; of these words some are definitely harder to grasp in terms of understanding and usage than the others. A search frequency-based list adjusts or modifies the traditional frequency according to such varying difficulty.

I'm considering revising or re-writing my Learning Spanish Words book. I may opt for the Linguee frequency list due to its practical considerations, and as a bonus, the list contains frequently used short phrases and abbreviations, and almost all words in the list are clean lemmas.

2020-06
(reposted from Goodreads blog)

Contact me

To my Miscellaneous Page