Chinese Character Usage Frequency

Chinese Character Usage Frequency

Chinese is considered one of the most difficult living languages to learn, primarily because there's little to no association between written characters and pronunciation. To maximize the speed to achieve reading literacy, the student should learn the most commonly used characters first. Fortunately you can find commonly used characters on the Internet such as 现代汉语常用字表, which lists the most frequently used 2500 characters. But the problem with this official list is that the characters are sorted by number of strokes (笔画) of each character. Ideally the student would like to see this list sorted by real-life usage (使用频率).

Earlier attempts to create this usage frequency list include 频度表 (frequency table) dated 1985 or 1987 cited in 关于发布《现代汉语常用字表》的联合通知, or 部件构字数降序排列表. Before the Internet, linguists resorted to laborious literature search and manual compilation to generate a character or word usage frequency list. The Internet to this work is as calculators to accounting. It not only makes this task so much faster but changes it to fun work. In recent years, there are Mats' work (description), Dylan Sung's Frequency of Characters (description), and Jun Da's Frequency Statistics. Although their researches all rely on computer technology, and some crawl on the Internet, none directly use the results of an existing Internet search engine. The work described here takes a different approach by using a popular Web search engine to achieve this goal, thus bringing this previously daunting task to a hobbyist at home. The computer program used to generate the character usage table can be re-run at any time. Full source code is available. (The code may be somewhat outdated. The procedure to prepare for Chinese character search and sorting can be simplified, because I did the work before the time I learned more about Unicode processing. The code can also be used to search and sort words of other languages.)

The method I use is to submit each Chinese character to a search engine and record the search count. For instance, searching for "一" on Google returns "about 37,900,000" results. For each character, we get an approximate search count like this. At the end, all the counts are sorted in descending order. The result is 常用汉字使用频率表 (Google) (completed in 2005). Similarly, the result for Yahoo search is 常用汉字使用频率表 (Yahoo), and that for Baidu is 常用汉字使用频率表 (Baidu) (both completed in 2009). The frequency lists are quite good judging by common sense.

One of the problems with the Google version is that some characters seem to have counter-intuitive order of frequency. For example, in real life, "等" is unlikely to be used more often than "人", and yet Google's result counts, either from my program or by manual search, are 58,800,000 and 55,500,000, respectively. A search on Baidu or Yahoo is consistent with common sense: "人" occurs more often than "等" on the Internet.

In the Baidu result, the first 108 characters have the same frequency, 100,000,000. It's likely that Baidu has capped the search hit count at that number. Within these 108 characters, some such as "县" are unlikely among the one hundred most commonly used. In fact, if you quickly glance through the entire 2500 character list, you'll see that a large number of characters have counter-intuitive order of frequency. Baidu is heavily biased toward the mainland Chinese web sites. I had good expectation of the usefulness of the Baidu character frequency list, but its abnormalities come as a surprise. It's unknown whether this is due to Baidu's gross approximation of search hit counts or due to the special characteristics of most Chinese web sites (repeated publication of the same articles, etc.).

This work has been subjected to peer review. The identified limitations are as follows.

While I consider the Internet as a near-perfect corpus, one reviewer argues that the Internet content does not necessarily reflect the language usage in people's everyday life, although it may be less of a problem for the first 2000 or 3000 characters. Statistical analysis of the search engine-based character frequency and one based on manual selection of representative samples may be needed to reveal the extent of the discrepancy.
One document may be duplicated in many places on the Internet but a search engine does not avoid duplicate count, which artificially increases weight for some characters. Although it partially reflects the importance of those documents, some of them such as ancient classical works may be overestimated.

(The program I wrote to submit the 2500 characters to search engine web sites undoubtedly reduces human labor. But there is an annoyance. Both Google and Yahoo, but not Baidu at least back then, have the intelligence to detect that the search actions with such highly regular pattern, repeated once per 2 seconds, must be from a "robot", not a human. So the search would be interrupted by the web sites prompting for the captcha verification. Restarting from where it stopped works for a while until it's caught again. If you do this from a big network whose outgoing IP address is a single one as is the case with many companies, the entire company will suffer; every employee trying a Google search will have to enter the captcha word, for a few hours. Since I have a programmer friend working at Yahoo, I asked him if he or his coworkers had any suggestions to me. After all, I'm not a bad guy. They recommended certain programming routines published by Yahoo. But I think they're an overkill for my job and I stopped running my program against these Web sites.)

Appendix

Detailed procedure to generate the Chinese character usage frequency list

The first step is to create a text file of the characters. I start with copying the characters on this page (only the 2500 常用字). Remove all unnecessary characters including "一画", "二画", etc. Save it as t.txt in UTF-8 encoding. I do all these on Windows XP. Suppose you have installed vim, PHP (actually only php.exe and php5ts.dll are needed) and UnxUtils (only sort.exe is absolutely needed) and the executables are in %path%, the subsequent steps are


gvim t.txt	vim works too
 %s/ //g	remove all white spaces
 %g/^$/d	remove blank lines
php -n -r "$f=fopen('t.txt','r'); while (!feof($f)) {$c=fread($f,1); if ($c!=\"\r\" and $c!=\"\n\") print $c;}" > nonl.txt	remove carriage returns and line feeds	

php -n google.php > result.txt	search on Google
d:\systools\sort -nr -k 2,2 result.txt > ChineseCharFrequencyG.txt	sort by usage frequency; Windows sort at c:\windows\system32 won't sort numerically

notepad ChineseCharFrequencyG.txt	final cosmetic editing

When I search on Yahoo instead, I replace the third step with
php -n yahoo.php > result.txt

To search on Baidu, there's a little more work before the third step, because Baidu accepts GB2312 or GBK instead of UTF-8 encoded characters as Google and Yahoo do. There're many ways to convert our UTF-8 encoded nonl.txt to a GB2312 file. The easiest is to simply open this text file in Internet Explorer (for IE6, you have drag the file into IE and not through File | Open | Browse), Save As by choosing GB2312 Encoding. Or you can use my iconv.php. Or binary ftp it to a UNIX/Linux box to run iconv and ftp the output file back. After that, the third step is
php -n baidu.php > result.txt

The real work is done by the program google.php, yahoo.php, or baidu.php. You can modify the program to suit your need. For instance, one reviewer points out that the search should be limited to documents in the Chinese language excluding Japanese or Korean documents that use some Chinese characters. If you manually submit a simplified Chinese language page search at Google Advanced Search, you'll see that the result URL has an additional "lr=lang_zh-CN" parameter. Read the source code for how to add this option. You can restrict by any option Google allows such as file format, document date, or some undocumented options documented in Google Hacks.

Make very sure the character file has no carriage returns or linefeeds or other unneeded characters. I use od -x nonl.txt to check it. (If you're not used to reading od -x dump on a little-endian machine, binary ftp it to a big-endian machine such as Sparc and run od -x.) Each Chinese character takes 3 bytes in the UTF-8 encoded file, which starts with e4, e5 etc. The very first 3 bytes of the character file is the UTF-8 marker so the script skips them. In a GB2312 encoded file, each character takes 2 bytes instead. The heart of the process is running the PHP program, which takes me 22 minutes to search 2500 characters on Google at a rate of close to 2 searches per second. You can monitor the progress by opening another command console and type tail -f result.txt. (Interestingly, the Yahoo search is frequently interrupted probably because Yahoo detects this continuous single-character search to be suspicious activity and throws error "Sorry, Unable to process request at this time -- error 999"! So you must monitor Yahoo search and restart some time later from where it failed.) Obviously, you can use any general purpose language to write this program. If you decide to search on other search engines, feel free to modify the php script. Generally, the exact HTTP GET command and the regular expression in the script need to be changed after you research how the search site uses them.

Links

Developing orthographic awareness among beginning Chinese language learners a dissertation citing this work
Natural Language Word Frequency
Contact me

To my Miscellaneous Page