Google OR Ask OR Gigablast OR Dogpile: A Comparison of Web ...

[Pages:25]1

Google OR Ask OR Gigablast OR Dogpile: A Comparison of Web-search Engines Robert F. Musco

Southern Connecticut State University

2

Introduction Web search engines are currently the primary method used to find information available on the web, but few people are aware of how they work, or which are more suited to different needs. The purpose of this paper is to compare the search tools of four different search engines, and to conduct a sample search, analyzing each sites results in terms of quantity, overlap, and relevance. The first three search engines discussed, Google, Ask, and Gigablast, were chosen because they are popular tools that each use their own proprietary software. The fourth, Dogpile, was selected because it is a popular example of a metasearch engine, which compiles information from four other search engines, including Google, Ask, MSN Live Search, and Yahoo. As the first step in understanding how these search engines operate, the documentation found at each web site was evaluated. The following section gives an overview of the search tools offered by each engine, and these findings are schematized in the comparative table in Appendix 2. It should be noted that Google and Asks documentation was more complete than that of Gigablast, while very little documentation was found on Dogpile. As a result, this overview is based on the information each company provided about its search functions. If a function was not mentioned, however, one cannot assume it is not available, since spot tests carried out throughout the investigation showed that some important functions not listed were indeed operational. The instances in which this occurred are mentioned in the paper. The actual operation of each engine in test searches will be discussed in the third section.

Search Engine Overview: Comparison of Search Tools Although Google is often viewed as the leader of "user-friendly" search engines, finding and compiling basic information about how its search function works required visiting several

3

sections of the website. Beginning in the Advanced Search, one sees that queries can be limited by "all these words, one or more of these words, this exact word or phrase, (none) of these unwanted words", language, format, site name, exact phrases, data created, usage rights, location of key words, region, and numeric range (Google, 2009a). Two additional limiters give results that are "similar to the page" specified, and "link(ed) to the page" specified.

More specific information is found in the Basic Search Help, such as advice about search strategies, and basic tips of how the engine operates (Google, 2009c). For example, Google ignores most punctuation, is case insensitive, and generally counts all words, though it may ignore a word if it considers it irrelevant. The "More Search Help" area is the most complete guide to searching terms (Google, 2009d). Boolean operators are permitted in the main search box, and a full list of additional operators, such as wildcard and synonym symbols, is given. Google results can also be limited to a series of subject headings, or performed over the entire web.

Googles main search box uses the AND operator by default when a space is left between two terms, though a quick test shows that if the AND is actually used, the number of hits may change, particularly if the search terms are common words, such as [cat AND dog]. These results indicate that unknown criteria come into play when using Boolean operators. Perhaps this is an example of Googles intelligent search overriding some operator commands in an attempt to get at the users "obvious" intent, as mentioned in the documentation.

Googles "Technology Overview" page, reached through its "Corporate Information" section, gives a clear, though simple explanation of the theory behind its search technology (Google, 2009b). Since the explanation does not enter into technical detail, it would be inadequate for a searcher with a great deal of technical expertise, but its discussion of strategies

4

for optimizing searches and using special operators is sufficient for the general user. Google explains that its robot, which crawls the web on a regular basis, is fully automated, meaning that the company cannot adjust page rankings. In fact, Google states that it does not accept payment for inclusion or placement in its ranking.

The explanation also helps to explain why search results may not exactly match search terms. Google uses its "PageRankTM algorithm", a proprietary ranking methodology, to weigh a number of factors, including the appearance and position of text on the page, in determining the relative relevance of web pages. The algorithm calculates a pages importance relative to similar pages by counting the number of linked pages that point to it. What is innovative about this method is that the "pointer" pages which link to the retrieved page are themselves weighed for their relevance by the number of pages that point to them in turn. Pages being evaluated are penalized if they contain links to "link farms", which are sites created purely for the purpose of raising other pages relevance. Sometimes search results may include pages that do not actually contain the search terms, but are reached through links in other pages described by text that contains the terms.

Google is undeniably one of the leaders in enhanced web search features that operate with shortcuts directly from the search box. These "bells and whistles" are small applications that give real-time information just by entering limited search terms, like automatic stock quotes by entering the stock symbol and Fed Ex tracking information by entering the tracking number (Google, 2009f).

Ask was founded as in 1996. Like Google, it allows searching from various subject categories, and has features similar to Googles in its Advanced Search page, which allows a query to be defined by "all the words, at least one of the words, the exact phrase, none

5

of the words", language, specific domain, exact phrases, date modified, location of key words, and region (Ask, 2009a).

Searches are case-insensitive, word order matters, and spelling is automatically corrected (Ask, 2009c). Though some of the operators listed in Google are not specifically listed in Asks Advanced Search Tips, Ask has developed even more Boolean-like operators than Google, which make it possible to limit searches not only to specific URLs, but also to specify date ranges or pages with the search terms in the titles or in hyper-linked text (Ask, 2009b). Oddly, it is never mentioned that AND is the default operator in ASK, though spot-testing shows that it is. Like Google, Ask produces variable results when a blank space between terms is compared to use of the AND operator, especially when the two terms are very common words.

The Site Features section in Ask also lists a number of enhanced shortcut features, though their operation can be confusing, since some are reached through menu categories, while others are activated with a keyword in the search terms (Ask, 2009e).

Ask also uses a proprietary algorithm, here called ExpertrankTM, that relies on a "clustering concept of subject-specific popularity", which ranks hits based on the number of pages which link to a site, weighing which of those pointer pages are more authoritative (Ask, 2009d). The method is not explained in more detail, though the description sounds very similar to Googles PageRankTM technology.

Gigablast offers a stripped-down search engine whose appearance is less commercial than Googles. Gigablast does not include any of the automatic shortcut functions for weather, stock market, etc., found on Google or Ask, but it does offer subject directories, though they were not functional during the end of February and beginning of March while this paper was being researched.

6

Gigablasts Advanced Search can handle searches restricting queries to "all these words, any of these words, this exact phrase, none of these words" (Gigablast, 2009a). Searches can also be limited to a specific URL, a specific site, pages linked to a specific URL, and one can choose to enable site clustering.

The "Query Syntax" section of Gigablast explains the Boolean operators permitted, which are mostly comparable to those in Google and Ask (Gigablast, 2009b). The AND operator is the default, but it is applied in a very specific way. For example, with two terms, preference is given to incidences of both terms next to each other. If one wants to avoid giving preference to both terms together, the operator [term .. term] can be used.

An OR operator actually gives preference to hits with both terms, which appears to be a strategy to increase relevance. Parentheses are said to be optional, and indeed, a test shows that both AND and a blank when used without parentheses will nest the two AND terms before applying another operator, such as OR, afterwards. For example, a search with [soup AND shoes OR train] yielded results whose top ten were devoted to national train companies, with no instances of "soup" or "shoe", since the algorithm obviously gave greater weight to the OR term it considered most important. When parentheses are used, however, they define the order in which the operators are applied, instead of the default left-to-right "AND-first" logic that appears to be used here. Thus, a sample of the hits from [soup AND (shoes OR train)] included pages with "soup" and "shoes", or "soup" and "train", and even some hits with only the word "soup".

Gigablast does not provide operators for limiting documents by date as Ask does, but one can restrict hits by formats, such as .doc or .xml. There is one unusual operator worthy of mention, which searches first by a primary term, than ranks all hits by a second term. Gigablast offers no explanation of how its algorithm functions.

7

Dogpile is a metasearch engine that links results from Google, Yahoo, MSN Live Search, and Ask. Dogpile includes sponsored links mixed among the results, though each is labeled as such. Dogpiles Advanced Search shows search boxes which handle the operators "all these words, any of these words, the exact phrase, none of these words", in a specific domain, and specific language (Dogpile, 2009a). The main search page provides tabbed categories for limiting searches to specific formats (images or music), or subjects (news, yellow pages, and white pages).

The "Metasearch 101" section explains the rationale for a metasearch engine (Dogpile, 2009c). A link can be found to a self-study, carried out in collaboration with University of Pittsburgh and the Pennsylvania State University in 2007, that found that less than one percent of first-page results on a given search query overlapped among the four major search engines (Dogpile, 2008). The implication is that with so little overlap, any claims of highly relevant results by competitors are suspect. The "InfoSpace" section mentions Dogpiles InfoSpace proprietary technology as the software behind the search, but does not explain how the search engine works, or how it ranks results (Dogpile, 2009b). A quick test shows that the AND function is at least partly a default, but in contrast to the other engines discussed here, Dogpile does not appear to allow Boolean operators, and only provides refined searches through its advanced search page. Indeed, using AND as an explicit operator with two terms gives inconsistent results, since the AND appears bolded in the results, suggesting that it is interpreted not as an operator, but at least sometimes as an actual search term. Using OR in a search with two terms actually reduces hits, suggesting that OR is not a Boolean operator. Performing a search from the advanced search box, however, it was possible to see that when a search is done with an exact phrase, the phrase then appears in the general search box with apostrophes placed

8

around it, and a minus sign before a "none of these words" term. Subsequent testing shows that apostrophes and minus signs are indeed active Boolean operators.

Dogpile is not consistently case-insensitive, in contrast to the other three engines, and does not always ignore "stop words". The most notable difference seen when using Dogpile is that the total number of hits in a search is not indicated on the page, and that the number of hits returned can frequently be several orders of magnitude less than products like Google.

Comparison of Search Results among Search Engines To analyze how the different search engines handle queries, a fairly limited topic containing multiple terms was chosen. The goal of the query was to find out if research shows that there is a correlation between playing violent video games and student achievement in high school students. The search was begun with general terms, and refined in four steps. Each separate search is indicated by a title of the search query, which is set off by brackets to frame exactly what was entered in the search boxes, not to be confused with nesting parentheses.

Search Query: [student AND achievement] Searching was begun with general terms, to give an idea of how each engine handled the AND operator [student AND achievement]. Google produced more than 38,000,000 hits, covering broad areas such as technology and student achievement, analysis of student achievement, and public policy related to student achievement. Of the top 20 hits, almost all were relevant in that they discussed the general issue of student achievement. Several documents even addressed the specific issue of factors related to student achievement, such as teacher quality, poverty, class size, library use. Most of the sampled results were from the domains .org, .gov, and .edu though a few were from .coms, such as journal databases. Hits included pages

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download