Just for fun, I tried three separate search engines, AltaVista, Google and Lycos, to see how their recalls differ. I chose those three since Wikipedia indicates that they index and search the interwebs independently of each other. (The figures below seem to substantiate this.)
I’m not interested in actual number of hits, but rather in ratios. Hence I searched for words in pairs (one word, two spellings) such as "keyboard" vs "kyeboard" (the latter being a typo), "occurring" vs "occuring" (common misspelling), "organisation" vs "organization" (variant spelling), and a handful others.
Ideally such recalls should tell me how much more common one construction is compared to another. As long as the indexing of the interwebs is comprehensive and/or sufficiently random, then each search engine should give me roughly equal ratios irrespective of the actual number of hits involved. However, the figures below indicate something different.
|"he isn’t"||"he ain’t"||Ratio|
|"than I"||"than me"||Ratio|
Taken at face value, it would appear that Google disagrees with the other two search engines when it comes to typos and misspellings, although the disagreement does not appear to be consistently in any one direction. When it comes to variant spellings and forms, there seem to be no general tendencies. In one case ("than I" vs "than me"), they all agree, in another ("he isn’t" vs "he ain’t"), they all disagree. In the other two cases, AltaVista and Google agree, while Lycos does not.
To be honest, I’m not entirely sure what this means. The fact that the differences are there ought to raise some alarm bells before trusting any figures provided by any of the search engines. Perhaps there’s a simple technical reason for all this. Idiosyncratic roundings off? Invisible spell-checkers? Biases indexing of the web? Biased recall procedures? Unfortunately, I’m too ignorant about how exactly search engines work. One thing is clear, however. Different search engines do it differently, which leads to the obvious question: should we trust any of them?
A corpus like COCA (i.e. Corpus of Contemporary American English) is more tailor-made for linguistics, and is therefore also more suited for linguistic queries. On the other hand, COCA doesn’t give us all aspects of actual language usage. The written-language part of the corpus contains texts drawn from published sources, and is thus composed of edited texts in which typos and non-standard usages have been weeded out. Typos like "epsiode" and "kyeboard", for instance, give no hits at all in COCA, while the “than I”/”than me” ratio is 6.5:1 in COCA (compared to the roughly 3.5:1 in the tables above).
Search engines like AltaVista, Google, Lycos, and others, index and search people’s unedited language usage out there "in the wild", warts, typos and all. Therefore their recalls should be more representative of actual usage. The trouble is, they give different results, as the above little excercise demonstrates.
At any rate, this isn’t a very comprehensive survey, being based on only a few searches. All I know at this point is that one obviously needs to be very cautious about interpreting numbers extracted from search engines. Most people seem to trust results offered by Google without even blinking. Indeed, many people use nothing *but* Google. Admittedly I do, too, normally. We might want to re-think our faith in Google. I know I will.