Random Thoughts about Search Engines [for Desktop Search Engines, scroll down] * A search engine can be used to help study a human language particularly English. Many people notice suggested spellings when they spell a word wrong (searching for "mariuna" on Google returns 776 results, below the line "Did you mean: marijuana", because Google has 43 million results for "marijuana"). Do you know another way to help you study English? Say you want to know if you should say "keep an eye on" or "keep eyes on". You can search for the phrases on Google. Make very sure (it's important!) to include the double quotes. "keep an eye on" as of this writing returns 1,990,000 while "keep eyes on" returns only 55,500. Not that only 55,500/(1,990,000+55,500) x 100% = 2.7% of people use both eyes when they want to closely watch something. It's just the way people say it. So follow the majority and say "keep an eye on". Another example. English beginners sometimes use the plural form of "information", "informations". On Google, the first one returns 14 billion and the second returns less than half billion, most of which are from foreign language sites. So you know "information" is correct. Do more people say "not only ... but" or "not only ... but also"? "'not only' 'but also'" returns 2.2 billion, and "'not only' but" returns about 3 billion. (Please preserve quotes as shown here when you do the searches!) Since the latter includes the former, you can conclude "not only ... but" (not followed by "only") would return 800 million. But of course this is not accurate because Google separates "not only" from "but" with any number of strings, possibly including a period (end of sentence). Support of regular expressions would help. * Why did Google replace AltaVista as the preferred search engine for computer geeks (nerds)? The primary reason is technical. But the minor non-technical reason may be that AltaVista takes too long to type and people are less familiar with the Spanish words "alta" and "vista". Typing "google" requires moving fingers to 4 keys while "altavista" requires 9. Why 4 for "google"? When you type the second "g" you still have one finger above "g" so there's no move. But typing each "a" in "altavista" requires a finger move, so does "t". Yahoo was the primary search engine for non-techies and is largely replaced by Google for technical reason. But Yahoo is still being used by many, partly because you only move fingers to 4 keys. In fact, it's slightly more efficient to press the 4 keys in "yahoo" than the 4 keys in "google". Besides, "yahoo" is one letter less than "google". * People have thought of all kinds of improvement on search engines, indexing images by pattern match, cataloging sound tapes, ... No search engine currently can do two things: (A) literal text search (B) regular expression search (A) is actually very easy to implement and should be as fast as or faster than current keyword search. So lack of this capability is surprising. Take "user$" as an example. It's the name of a table inside Oracle database. But if you search for "user$" in any search engine (Google, Yahoo, AltaVista, MSN, AllTheWeb, Excite, MetaCrawler), the search results are exactly the same as if you searched for "user". Searching for "$$" returns nothing. (It's a variable commonly used in many scripting languages and UNIX shell.) Surprisingly, Google allows you to search for "$_" because it gives special treatment to this Perl variable. But in general, lack of this functionality limits search engine use for IT professionals. (B) is missing probably for a good reason, security. Say you want to search for the string "'net profit exceeds [^S]+ dollars'", where [^S]+ represents some non-space characters such as digits. Currently you can only search for "'net profit exceeds' dollars" and the results would be more than you want. Unfortunately to allow regular expressions such as [^S]+ adds too much burden to the search engine programmers to prevent hackers' abuse. Random Thoughts about Desktop Search Engines (DSE) * I had high expectation for Google Desktop Search. But I uninstalled it because I can't figure out how to tell it to index the files only in certain directories; it apparently indexed more than what I wanted. I can't let it index only certain file types, most notably PDF[note] which Windows explorer can't literally search. If these can be tolerated, one thing can not. I made sure I'm not using Advanced feature and didn't choose anything that Google claimed could connect to their servers to retrieve information, but the Google DSE process still constantly makes connection to the following web servers according to TCPView (or "netstat -ano"): ed-in-f99.google.com in-in-f99.google.com ed-in-f104.google.com in-in-f99.google.com in-in-f99.google.com in-in-f99.google.com So I added them to c:\windows\system32\drivers\etc\hosts and associated them with 127.0.0.1. That didn't stop it. Even if I could stop that, Google DSE still relies on local port 4664 to be open when you use it. Every time you see a process unnecessarily access some hosts particularly those on the Internet, you should be alerted, no matter how prestigious the company providing the software is. * I started to play with Indexing Service that comes with Windows XP. Start the service in Computer Management (compmgmt.msc or services.msc), possibly set it to Auto start on boot. Go to this service control (in compmgmt.msc, Services and Applications -> Indexing Service, or ciadv.msc for short). Stop indexing the directories you don't want, such as "C:\Documents and Settings". Add those you want. Wait a few minutes and start to use "Query the Catalog". The good thing about this built-in indexing service is that it does not make any network connection to any host, not even localhost. The limitation is also serious, though; you can't limit it to certain file types, can't index documents other than regular files, such as Outlook email messages, and like all other DSEs or powerful search web sites, can't index literal strings such as "user$", "v$" etc. * Finally I settled on Microsoft Windows Desktop Search. See KB917013 at www.microsoft.com/downloads/details.aspx?FamilyID=4982072f-7660-492f-b96c-e42b4f5ab4aa It allows indexing Outlook email messages, choosing file types, and it makes no network connections. One cool trick few people use: If you think it takes too much CPU, lower the priority of the searchindexer.exe and WindowsSearch.exe processes. However, since the service is run as Local System by default, you either need to run psexec -s -i -d taskmgr (psexec is from sysinternals.com) or run the service as a real user to have the privilege to change the priority of searchindexer.exe, the process for the service. _____________________ [note] If I don't use a DSE, I have to use this command to find PDF documents that contains a certain string: for %i in (*.pdf) do pdftotext "%i" - 2>nul: | grep -l -i "searchstring" where pdftotext.exe is from www.foolabs.com/xpdf/ and grep is available in many places (can be replaced with findstr /m /i).