Herb Name Frequency

The classical herb catalog 本草纲目 lists more than 1800 herbs. As a start, a student studying herbal medicines needs to learn at least 100 the most commonly used. 新世纪全国高等中医药院校教材(新一版)"《中药学》教学大纲 lists the recommended 133. The herbs in the list are selected by experienced teachers and Chinese medicine practitioners. Students memorizing those herbs are expected to have a preliminary proficiency in the subject.

This article is an investigation of occurrence frequency of herb names on the Internet. The idea is that if an herb name is frequently used on Web pages, the herb is also assumed to be commonly used, and they can be ordered in their usage frequency. Everything else being equal, these herbs can be studied in this order to maximize the study efficiency.

The method I use is to submit each herb name to a search engine and record the search count. For instance, as of this writing, searching for "人参" on Google returns "about 1,340,000" results. For each herb we get an approximate search count and all the counts are sorted in descending order. The result is 常用草药名频率表 (Google). The detailed procedure is in Appendix.

By no means would my work replace an herb name list made by a human expert. Instead, this serves as a reference for textbook writers, Chinese medicine teachers and students, and provides them a new, different, perspective. This is also an entertaining exercise in itself. The limitations of this Web search are obvious. For instance, the 常用草药名频率表 (Google) lists 玫瑰花 as more "commonly used" than 人参. This is because 玫瑰花 can also be used as a word in our everyday life not in the context of herbal medicine. It's very difficult or impossible for me to limit the search to the herbal medicinal context. Another problem is related to partial name overlap. 板蓝根 and 南板蓝根 have frequency counts of 180000 and 25000, respectively. Although the former is indeed more common, the difference is exaggerated because I'm not submitting to Google a search string "板蓝根 -南板蓝根" (i.e. 板蓝根 excluding 南板蓝根). Other limitations are repeated count of the same documents published in different places on the Internet, miss of those documents published in paper form but not on the Internet, very rough estimation of Google's word search count, and (not necessarily a limitation) bias toward documents in only one language.


Detailed procedure to generate the herb name frequency list

First I get a list of herb names from a Web site. (I can't find an authoritative list on the Internet.) Edit the file by keeping just the herb names, each on one line. Save the file as Herbs.txt in UTF-8 encoding. Suppose you have installed php, go to DOS and type

php -n google_herbs.php > result.txt	search on Google
d:\systools\sort -nr -k 2,2 result.txt > HerbFrequencyG.txt	sort by usage frequency; Windows sort at c:\windows\system32 won't sort numerically
notepad HerbFrequencyG.txt	final cosmetic editing

The real work is done by the program google_herbs.php. You can modify the program to suit your need. For instance, the current program searches Google for documents in simplified Chinese only. You can let it search in traditional Chinese. If you manually submit a traditional Chinese language page search at Google Advanced Search, you'll see that the result URL has a "lr=lang_zh-TW" parameter. Read the source code for how to change this option. You can restrict by any option Google allows such as file format, document date, or some undocumented options documented in Google Hacks. Running the PHP program takes me four and half hours to search 540 herbs on Google at a rate of about 2 searches per second. Obviously, you can use any general purpose language to write this program. If you decide to search on other search engines, feel free to modify google.php, particularly the regular expression part.

To my Miscellaneous Page