Search engines, recalls and ratios

Just for fun, I tried three separate search engines, AltaVista, Google and Lycos, to see how their recalls differ. I chose those three since Wikipedia indicates that they index and search the interwebs independently of each other. (The figures below seem to substantiate this.)

I’m not interested in actual number of hits, but rather in ratios. Hence I searched for words in pairs (one word, two spellings) such as "keyboard" vs "kyeboard" (the latter being a typo), "occurring" vs "occuring" (common misspelling), "organisation" vs "organization" (variant spelling), and a handful others.

Ideally such recalls should tell me how much more common one construction is compared to another. As long as the indexing of the interwebs is comprehensive and/or sufficiently random, then each search engine should give me roughly equal ratios irrespective of the actual number of hits involved. However, the figures below indicate something different.

TYPOS &
MISSPELLINGS
  keyboard kyeboard Ratio  
AltaVista 483,000,000 23,000 21,000:1
Lycos 26,162,417 1,158 22,593:1
Google 91,000,000 30,300 3003:1 ⬅ more
  occurring occuring Ratio  
AltaVista 149,000,000 12,400,000 12:1
Lycos 8,235,538 675,116 12:1
Google 45,500,000 14,400,000 3:1 ⬅ more
  episode epsiode Ratio  
AltaVista 844,000,000 1,360,000 621:1
Lycos 45,426,313 53,927 842:1
Google 397,000,000 306,000 1,297:1 ⬅ less
VARIANT
SPELLINGS
& FORMS
  organization organisation Ratio  
AltaVista 1,620,000,000 811,000,000 2.0:1
Lycos 467,042,950 45,292,899 10.3:1 ⬅ less
Google 248,000,000 142,000,000 1.8:1
  isn’t ain’t Ratio  
AltaVista 368,000,000 82,900,000 4.5:1
Lycos 80,699,040 13,814,999 5.8:1 ⬅ less
Google 223,000,000 52,500,000 4.2:1
  "he isn’t" "he ain’t" Ratio  
AltaVista 298,000 97,000 3.1:1 ⬅ more
Lycos 1,912,390 306,715 6.2:1 ⬅ less
Google 5,090,000 11,900,000 1:2.3 ⬅ !!!
  "than I" "than me" Ratio  
AltaVista 216,000,000 69,000,000 3.1:1
Lycos 11,704,123 3,402,830 3.4:1
Google 54,700,000 15,300,000 3.6:1

Taken at face value, it would appear that Google disagrees with the other two search engines when it comes to typos and misspellings, although the disagreement does not appear to be consistently in any one direction. When it comes to variant spellings and forms, there seem to be no general tendencies. In one case ("than I" vs "than me"), they all agree, in another ("he isn’t" vs "he ain’t"), they all disagree. In the other two cases, AltaVista and Google agree, while Lycos does not.

To be honest, I’m not entirely sure what this means. The fact that the differences are there ought to raise some alarm bells before trusting any figures provided by any of the search engines. Perhaps there’s a simple technical reason for all this. Idiosyncratic roundings off? Invisible spell-checkers? Biases indexing of the web? Biased recall procedures? Unfortunately, I’m too ignorant about how exactly search engines work. One thing is clear, however. Different search engines do it differently, which leads to the obvious question: should we trust any of them?

A corpus like COCA (i.e. Corpus of Contemporary American English) is more tailor-made for linguistics, and is therefore also more suited for linguistic queries. On the other hand, COCA doesn’t give us all aspects of actual language usage. The written-language part of the corpus contains texts drawn from published sources, and is thus composed of edited texts in which typos and non-standard usages have been weeded out. Typos like "epsiode" and "kyeboard", for instance, give no hits at all in COCA, while the “than I”/”than me” ratio is 6.5:1 in COCA (compared to the roughly 3.5:1 in the tables above).

Search engines like AltaVista, Google, Lycos, and others, index and search people’s unedited language usage out there "in the wild", warts, typos and all. Therefore their recalls should be more representative of actual usage. The trouble is, they give different results, as the above little excercise demonstrates.

At any rate, this isn’t a very comprehensive survey, being based on only a few searches. All I know at this point is that one obviously needs to be very cautious about interpreting numbers extracted from search engines. Most people seem to trust results offered by Google without even blinking. Indeed, many people use nothing *but* Google. Admittedly I do, too, normally. We might want to re-think our faith in Google. I know I will.

Advertisements