August 2010 – Blogala Maho

Just for fun, I tried three separate search engines, AltaVista, Google and Lycos, to see how their recalls differ. I chose those three since Wikipedia indicates that they index and search the interwebs independently of each other. (The figures below seem to substantiate this.)

I’m not interested in actual number of hits, but rather in ratios. Hence I searched for words in pairs (one word, two spellings) such as "keyboard" vs "kyeboard" (the latter being a typo), "occurring" vs "occuring" (common misspelling), "organisation" vs "organization" (variant spelling), and a handful others.

Ideally such recalls should tell me how much more common one construction is compared to another. As long as the indexing of the interwebs is comprehensive and/or sufficiently random, then each search engine should give me roughly equal ratios irrespective of the actual number of hits involved. However, the figures below indicate something different.

TYPOS & MISSPELLINGS		keyboard	kyeboard	Ratio
	AltaVista	483,000,000	23,000	21,000:1
	Lycos	26,162,417	1,158	22,593:1
	Google	91,000,000	30,300	3003:1	⬅ more
		occurring	occuring	Ratio
	AltaVista	149,000,000	12,400,000	12:1
	Lycos	8,235,538	675,116	12:1
	Google	45,500,000	14,400,000	3:1	⬅ more
		episode	epsiode	Ratio
	AltaVista	844,000,000	1,360,000	621:1
	Lycos	45,426,313	53,927	842:1
	Google	397,000,000	306,000	1,297:1	⬅ less
VARIANT SPELLINGS & FORMS		organization	organisation	Ratio
	AltaVista	1,620,000,000	811,000,000	2.0:1
	Lycos	467,042,950	45,292,899	10.3:1	⬅ less
	Google	248,000,000	142,000,000	1.8:1
		isn’t	ain’t	Ratio
	AltaVista	368,000,000	82,900,000	4.5:1
	Lycos	80,699,040	13,814,999	5.8:1	⬅ less
	Google	223,000,000	52,500,000	4.2:1
		"he isn’t"	"he ain’t"	Ratio
	AltaVista	298,000	97,000	3.1:1	⬅ more
	Lycos	1,912,390	306,715	6.2:1	⬅ less
	Google	5,090,000	11,900,000	1:2.3	⬅ !!!
		"than I"	"than me"	Ratio
	AltaVista	216,000,000	69,000,000	3.1:1
	Lycos	11,704,123	3,402,830	3.4:1
	Google	54,700,000	15,300,000	3.6:1

Taken at face value, it would appear that Google disagrees with the other two search engines when it comes to typos and misspellings, although the disagreement does not appear to be consistently in any one direction. When it comes to variant spellings and forms, there seem to be no general tendencies. In one case ("than I" vs "than me"), they all agree, in another ("he isn’t" vs "he ain’t"), they all disagree. In the other two cases, AltaVista and Google agree, while Lycos does not.

To be honest, I’m not entirely sure what this means. The fact that the differences are there ought to raise some alarm bells before trusting any figures provided by any of the search engines. Perhaps there’s a simple technical reason for all this. Idiosyncratic roundings off? Invisible spell-checkers? Biases indexing of the web? Biased recall procedures? Unfortunately, I’m too ignorant about how exactly search engines work. One thing is clear, however. Different search engines do it differently, which leads to the obvious question: should we trust any of them?

A corpus like COCA (i.e. Corpus of Contemporary American English) is more tailor-made for linguistics, and is therefore also more suited for linguistic queries. On the other hand, COCA doesn’t give us all aspects of actual language usage. The written-language part of the corpus contains texts drawn from published sources, and is thus composed of edited texts in which typos and non-standard usages have been weeded out. Typos like "epsiode" and "kyeboard", for instance, give no hits at all in COCA, while the “than I”/”than me” ratio is 6.5:1 in COCA (compared to the roughly 3.5:1 in the tables above).

Search engines like AltaVista, Google, Lycos, and others, index and search people’s unedited language usage out there "in the wild", warts, typos and all. Therefore their recalls should be more representative of actual usage. The trouble is, they give different results, as the above little excercise demonstrates.

At any rate, this isn’t a very comprehensive survey, being based on only a few searches. All I know at this point is that one obviously needs to be very cautious about interpreting numbers extracted from search engines. Most people seem to trust results offered by Google without even blinking. Indeed, many people use nothing *but* Google. Admittedly I do, too, normally. We might want to re-think our faith in Google. I know I will.

Month: August 2010

Search engines, recalls and ratios