Chinese Character Usage Frequency

Chinese is considered one of the most difficult living languages to learn, primarily because there's little to no association between written characters and pronunciation. To maximize the speed to achieve reading literacy, the student must learn the most commonly used characters first. Fortunately you can find commonly used characters on the Internet such as 现代汉语常用字表 (also available here), which lists the most frequently used 2500 characters. But the problem with this official list is that the characters are sorted by number of strokes (笔画) of each character. Ideally the student would like to see this list sorted by real-life usage (使用频率).

Earlier attempts to create this usage frequency list include 频度表 dated 1985 or 1987 cited in 关于发布《现代汉语常用字表》的联合通知, or 部件构字数降序排列表. Before the Internet, linguists resorted to laborious literature search and manual compilation to generate a character or word usage frequency list. The Internet to this work is as calculators to accounting. It not only makes this task so much faster but changes it to fun work. In recent years, there are Mats' work (description), Dylan Sung's Frequency of Characters (description) and Jun Da's Frequency statistics. Although their researches all rely on computer technology, and some crawl on the Internet, none directly use the results of an existing Internet search engine. The work described in this article takes a different approach by using a popular Web search engine to achieve this goal, thus bringing this previously daunting task to a hobbyist at home. The computer program used to generate the character usage table can be re-run at any time. Full source code is available.

The method I use is to submit each Chinese character to a search engine and record the search count. For instance, as of this writing, searching for "一" on Google returns "about 37,900,000" results. For each character, we get an approximate search count and all the counts are sorted in descending order. The result is 常用汉字使用频率表 (Google). The detailed procedure is in Appendix.

One of the problems with this Google version is that sometimes Google returns wrong results; for instance, the top result of searching for "二" has no "二" in the page at all, not even in the Google-cached page. Baidu and Yahoo China don't have this problem. The other problem is that some characters seem to have counter-intuitive order of frequency; in real life, "等" is unlikely to be used more often than "人", yet Google's result counts, either from my program or by manual search, are 58,800,000 and 55,500,000, respectively. A search on Baidu or Yahoo China is consistent with common sense: "人" occurs more often than "等" on the Internet. I can easily change my program to search on the sites geared more toward the Chinese language. But from US where I am now, the program may run much longer than 22 minutes as my Google version does.

This work has been subjected to peer review. The identified limitations are as follows

Appendix

Detailed procedure to generate the Chinese character usage frequency list

The first step is to create a text file of the characters. I start with copying the characters on this page. Remove all unnecessary characters including "一画", "二画", etc. Save it as t.txt in UTF-8 encoding. I do all these on Windows. Suppose you have installed vim, PHP and UnxUtils and the executables are in %path%, the subsequent steps are

gvim t.txt
 %s/ //g	remove all white spaces
 %g/^$//	remove blank lines
php -n -r "$f=fopen('t.txt','r'); while (!feof($f)) {$c=fread($f,1); if ($c!=\"\r\" and $c!=\"\n\") print $c;}" > nonl.txt	remove carriage returns and line feeds	
php -n google.php > result.txt	search on Google
d:\systools\sort -nr -k 2,2 result.txt > ChineseCharFrequencyG.txt	sort by usage frequency; Windows sort at c:\windows\system32 won't sort numerically
notepad ChineseCharFrequencyG.txt	final cosmetic editing

The real work is done by the program google.php. You can modify the program to suit your need. For instance, one reviewer points out that the search should be limited to documents in the Chinese language excluding Japanese or Korean documents that use some Chinese characters. If you manually submit a simplified Chinese language page search at Google Advanced Search, you'll see that the result URL has an additional "lr=lang_zh-CN" parameter. Read the source code for how to add this option. You can restrict by any option Google allows such as file format, document date, or some undocumented options documented in Google Hacks.

Make very sure the character file has no carriage returns or linefeeds or other unneeded characters. I use od -x nonl.txt to check it. (If you're not used to reading od -x dump on a little-endian machine, binary ftp it to a big-endian machine such as Sparc and run od -x.) Each Chinese character takes 3 bytes in the file, usually starting with e4, e5 etc. The very first 3 bytes of my character file is not a Chinese character so google.php skips them. The heart of the process is running the PHP program, which takes me 22 minutes to search 2500 characters on Google at a rate of close to 2 searches per second. Obviously, you can use any general purpose language to write this program. If you decide to search on other search engines, feel free to modify google.php, particularly the regular expression part.




To my Miscellaneous Page