Section 2:



Returning Multiple Pages as Individual Search Results

Rohan Angrish (angrish@stanford.edu)

Anisha Malhotra (anisha@cs.stanford.edu)

Section 1

Motivation

When users submit queries to most search engines, the results they get are sets of single pages that contain all of the tokens in the query string. We attempt to address this issue in our project, and suggest a different strategy that could give possibly interesting results. Our main idea is to look at multiple pages while trying to match the tokens present in a search string. We do not limit the search results to single pages that contain all the tokens. In stead, we look at groups of “logically” linked documents and return such groups of documents as individual results if all the tokens in the search string are present in the group. This idea, of answering queries through groups of documents in the hope of finding useful results that might not have been found by other search engines, is the main motivation behind our effort.

Section 2:

Related Work:

The idea of using text other than that present on a single page while searching, is not new. For instance, the method described in [1] uses a similar idea. It talks about the use of what is known as “anchor text” while ranking a page. Anchor text is the text of links that point to a page. To quote this paper, “…anchors are very useful. First, anchors are often better descriptions of Web pages than the pages themselves. They frequently state precisely what is significant about the Web page. Second, they are often written by people other than the author of the Web page, so they are more resistant to malicious tampering to move a page to the top of the search results for commercial gain... Finally, anchors allow a search engine to return a Web page even without crawling it.”

In this sense, this work is related to our idea; in that it uses data from linking pages while searching for results of a query, and not just the text on a single page. We have yet to find work that talks explicitly about using groups of pages as a single logical unit while answering a query. We feel that this concept of answering a query through a set of linked pages is a potentially interesting one.

Section 3:

Work Space:

The repository of pages used by us to test out the algorithm is that constructed by WebBase [2]. The language of implementation is Java. We used a number of routines that were provided to our class by the course instructors. These include WebBase routines, and Lucene routines used to create indexes and to search for results to a query.

Section 4:

Method:

This section outlines the method developed by us to achieve the effect described in Section 1. To restate the objective; the aim is to come up with a method, that searches for tokens in logically “linked” documents, and returns a group of documents as a single result if the all the query tokens are present in this group. In a sense, this method can be looked upon as a method that enforces the conjunctive nature of query satisfaction, but extends the usual definition of a page to now include multiple pages.

We consider two types of relationships between documents that could bind them into a single logical unit. According to the first binding criterion, two documents could be said to belong the same logical unit if they share a “sub-directory” kind of parent-child relationship. For instance, stanford.edu and stanford.edu/dept/ share such a relationship. The second binding criterion says that documents are logically bound if there is a hyperlink from one to the other. These are not the only relationships that could have been used, but are the two that we have implemented in order to test out our method.

There are 2 very essential parameters on which the performance of our algorithm depends. These are (a) MAX_LEVEL and (b) MAX_LINK. MAX_LEVEL decides the number of levels our algorithm will look at while considering the parent-child relationship as the binding criterion. Similarly, MAX_LINK determines the number of levels our algorithm will look at while considering the hyperlink relationship as the binding criterion.

Having put the problem in perspective, we now describe the algorithm used by us to solve the problem. We use a generic query in order to explain it better.

The first step is to create an index on the available repository of pages, for which we use Lucene. Lucene also has a routine that allows one to enter a search string, and returns pages that contain all the tokens in the search string. In particular, Lucene can be used to return all the pages that contain a particular token. Apart from this, Lucene allows one to incorporate concepts like stemming and stop-words.

Now, consider a query posed by a user; “A B C D E”. Note here that the alphabets A, B, C, D and E are merely placeholders for tokens of a search string. We describe in detail the working of the part of the algorithm that uses the parent-child relationship as the binding criterion. The other part of the algorithm that uses hyperlinks as the binding criterion works in a very similar manner.

Having read the above search string from the user, we query Lucene with ‘A’ as the query string. Let the set of URLs returned as a result to this query be RA. Similarly, we have RB, RC, RD and RE. We create a hash-table containing the results of this query, where each entry only contains the URL of the page. The hashing is done on the URL itself. The idea behind creating this hash-table is that it then makes it very easy to see if a given URL contains the token ‘A’. We call this hash-table HA. In a similar manner, we create HB, HC, HD and HE.

Besides these tables, we create another hash-table called allDocs. The hash-function used again takes the URL of a page as the argument. There is an entry corresponding to each element of the set of URLs defined by the union of RA, RB, RC, RD and RE. Each entry contains the following information: url which is the page to which that entry corresponds; valid which is a binary variable that tells us whether or not that entry has been looked at before; and levelNodes which is a vector of elements where each element contains the following values: an integer value level and a vector of keywords.

For instance, consider the case where MAX_LEVEL is 2. In other words, we look at 2 levels while considering the parent-child binding criterion. Say the a URL a1/ belongs to RA and it’s parent belongs to RB and RD. After the entry corresponding to a1/ evaluated, it will look like shown in Fig.1 below. It should be quite clear that the levelNodes with level value k greater than 0 can be computed by considering the levelNode with level value k-1 of the parent entry in allDocs (if such an entry exists). This is the central idea behind the algorithm, and it reason for the linear running time.

Now consider that we are looking at the entry for a URL …/a1/a2/a3/ for the first time. We know this because the valid value of this entry is 0. We need to construct the MAX_LEVEL levelNodes for this entry. In order to construct the levelNode with level value 0, we do a lookup on the URL in each of the hash-tables HA, HB, HC, HD and HE. For each table that the URL belongs to, we enter the corresponding token to the keywords vector of the levelNode.

[pic]

Fig. 1: An entry in the hash-table allDocs

The keywords entry of the levelNode with level value 1 is the union of the keywords entry in the levelNode with level value 0 and the keywords entry in the levelNode with level value 0 in the allDocs entry for the parent ww.…/a1/a2/. If the entry for the parent does not exist in allDocs, the latter is a null set. We then proceed to calculate the remaining levelNodes by looking at …/a1/ and so on and so forth. On the other hand, if the entry for the parent exists, we look at its valid value. If this is 1, we just look up the required levelNodes to construct the levelNodes for …/a1/a2/a3/. If the valid value is 0, we perform the same process on this element, and read off the required values once this entry has been completed processed. Fig.1 is a useful aid in understanding this concept as well. If at some level ‘n’, the keywords entry is found to contain all the tokens in the query, the URL …/a1/a2/a3/ and it’s closest ‘n’ ancestors are returned as single result. Also, once all the tokens have been found, we need not construct any more levelNodes for this entry.

On completing the construction of the levelNodes for …/a1/a2/a3/, we set its valid value to 1. The valid value of an entry makes sure that we do not look at an entry more than once.

Now that we know how an individual entry in the allDocs table is handled, it is easy to see how the method works. It looks at each entry in the allDocs hash-table, and computes its levelNodes. If the valid value of the entry is 1, it skips to the next entry in the table. When we have looked at each entry in the allDocs table, we will have found all the hits. The hyperlink part of the algorithm is very similar in nature, and we give a brief description of it next.

We reuse the set of URLs for each query word from the precious part of the algorithm. For this part of the algorithm, we need to find all the forward links of a page. So while we we index using Lucene, we also create a URL database of our own. For each page that we index using Lucene, we also fetch all its hyperlinks and store them. We use this database to fetch all the forward links of a page without having to parse the page data. Let us again suppose that the query is - “A B C D E”. We start with the set allDocs, and for each URL in this set, we find which query words it has by probing into the hash-tables HA, HB, HC, HD and HE. We add all the query words that this URL has to the set – “keywordsSoFar”. Then we fetch all the forward links for this URL and run a depth first search algorithm. For each URL we check if it contains any of the remaining query words (query words not in keywordsSoFar). If it does then we add the query words to the set “keywordsSoFar”. We run the same DFS algorithm on this URL now and continue till we have found all the query words or have exceeded MAX_LINKS. If all the query words have been found, then we display the set of URLs traversed so far. If we come to a URL that has none of the keywords, we just fetch its forward links and continue till we have found all the query words or have exceeded MAX_LINKS.

Let l be the average running time of Lucene to perform a single term search. The running time of our algorithm is roughly linear in l. Since we store the results from Lucene in hash-tables and do constant time look-ups on these hash-tables that are in memory, the total amount of time taken to run is dominated by the time taken to get results from Lucene. Since we do only single term searches in Lucene, and the number of terms in the query is typically small (less than 10), this value is also not very high. Hence, the running time is linear in the average running time of Lucene for single term searches.

Section 5:

Evaluation

In order to evaluate our algorithm, we use two indicators. The first one compares our algorithm to the usual algorithm that requires all the tokens in the search string to be present in a single page. Google, for instance, works largely in this manner, mainly for performance reasons. For sake of discussion, we will call this the single-page algorithm or SPA and call our algorithm multiple-page algorithm or MPA. The other indicator of the usefulness of MPA is the measure of the relevance of the results returned. Since we use the heuristics of parent-child relationships and hyperlinks in order to link pages, it is possible that pages that are linked together are not actually a relevant result for the query.

In particular, the first indicator measures the proportion of queries for which the SPA returns no results, while MPA actually does return something. The second indicator measures the proportion of relevant results to the total results returned by our algorithm.

As mentioned before, the document space used for this evaluation is the repository of documents created by WebBase. We asked subjects from various differing fields to test out our algorithm, and check the relevance of the results returned. They were told to document the queries they posed, and for each query (Q), see the total number of responses they got from the SPA (S), see the total number of responses they got from the MPA (M) and the number of these that were relevant (R).

A snapshot of the table generated by them is given in Fig.2. A more detailed snapshot is given in the Appendix. Please note that this algorithm is only relevant when the queries contain more than 1 token. Hence, each query must have at least 2 non-identical tokens.

| |Total # results from SPA |Total # results from|# of relevant |

|Query (Q) |(S) |MPA (M) |results from MPA (R)|

|Football newspaper |1 |4 |3 |

| | | | |

|wine Wednesday | | | |

| |0 |1 |1 |

Fig. 2 Snapshot of table filled in by the users

As mentioned in the start of Section 4, the algorithm takes the parameters MAX_LEVEL and MAX_LINK as input. The SPA is equivalent to setting both these values to 1. In order to test out our algorithm, and make the evaluation task simpler for the users, we set both of these values to 2 for the MPA. This was sufficient to demonstrate the usefulness of our technique.

Section 6:

Results

The total number of queries that were used to test our algorithm was 100. There were 6 different users that were asked to evaluate the algorithm. We found that, on average, MPA returned 39% more relevant results than SPA. This was calculated by finding (∑R - ∑S)/∑S. This figure gives the proportion of relevant results returned by MPA to the relevant results returned by SPA. In fact, this is a conservative estimate since while doing this calculation we assumed that all the results returned by SPA are relevant. Since this is not always the case, the actually value is probably even higher. A more accurate measure would be to find (R-S)/S for each Q, and then average over all Q. However, we did not use this since the value of S can be zero. It was for this reason that we used the expression above.

On the whole, it was found that 77% of the results returned by the MPA, were relevant. For the same reason mentioned above, this was calculated using the expression ∑R/∑M. The more accurate value would be to find R/M for each Q and then average over all Q. However, these two expressions should both converge to the same value.

These values give an indication of the usefulness of the MPA. Not only does it give a good amount of relevant results over the SPA, but also has high overall precision.

Section 7:

Future Work

There are many ways in which work on this project can be continued. As mentioned in Section 4, the two binding criteria used are not the only that could have been used. Another interesting one is that of the sibling relationship. Documents that are both children of the same document can be thought of as logically linked.

While returning results to the queries, we make no attempt to rank these results in any way. Displaying the results in ranked order could make them more useful. Also, the relationship between results returned while considering different binding criteria is not very clear. It is not very clear when different results are actually distinct. This concept, of finding similar results and merging them, is an interesting problem in itself.

Another interesting extension is in the evaluation. As mentioned in Section 5, due to lack of time, and in order to make the application easy to evaluate for the users, we set MAX_LEVEL and MAX_LINK to 2. It might be useful to vary these parameters and come up with comparisons between the proportions of relevant results in each case. For instance, a graph of proportion of relevant results vs. MAX_LEVEL and MAX_LINK might be very useful. This could help determine an optimal value for MAX_LEVEL and MAX_LINK. Also, the exact figures depicting the comparison between the running times of the SPA and MPA are to be calculated.

Bibliography:

[1] J. HIRAHI, S. RAGHAVAN, H. GARCIA-MOLINA, A. PAEPCKE. WebBase: A repository of web pages. Computer Networks, 2000

[2] S. BRIN, R. MOTWANI, L. PAGE, T. WINOGRAD. What can you do with a Web in your Pocket? Bulletin of the Technical Committee on Data Engineering, 1998

Appendix

Sample Results

|Query (Q) |Total # results from SPA (S) |Total # results from MPA (M) |Number of relevant results (R) |

|Quarter, football |2 |4 |4 |

|Ecommerce, computer |0 |2 |1 |

|Microsoft, compiler |2 |3 |3 |

|Suif, lam |6 |12 |10 |

|Information,retrieval |20 |26 |22 |

|Imaging, synthesis |5 |15 |9 |

|Google, job |0 |2 |0 |

|Medical, health |23 |32 |26 |

|Graduate, business, courses |14 |35 |25 |

|Alumni, meeting |1 |18 |10 |

|Parking, ticket |3 |10 |6 |

|Analysis, complexity, algorithm |2 |2 |2 |

|Solaris, sun |1 |1 |1 |

|Newspaper, football |1 |3 |2 |

|Dance, classes |2 |3 |3 |

|Library, policy |42 |73 |54 |

|Map, driving, campus |20 |24 |22 |

|Laboratory, electrical, |10 |10 |10 |

|engineering | | | |

|Campus, recruitment, schedule |3 |3 |3 |

|Statistics tutorial |5 |7 |6 |

|Campus, food |10 |15 |12 |

|Housing, campus |15 |27 |20 |

|Foreign, language, classes |3 |4 |4 |

|Publications, earthquake |2 |3 |3 |

|International student visa |0 |2 |1 |

|Course enrollment deadline |4 |4 |4 |

|Library visitors |9 |42 |30 |

|Used sale |11 |11 |11 |

|University rules |18 |21 |20 |

|Publications hector |3 |17 |15 |

|Degree admission master |12 |16 |16 |

|Football ranking |3 |9 |5 |

|Medical school office building |5 |9 |5 |

|Café vending |0 |0 |0 |

|Gene replication |2 |2 |2 |

|Email newsgroup |2 |7 |5 |

|Robotics artificial intelligence|6 |8 |6 |

|Fire department |2 |3 |2 |

|Music jazz |4 |4 |4 |

Setup and Execution

We need Lucene for indexing and searching documents. Thus we need the Lucene and webbase classes to be in the system CLASSPATH. We have used the setup script provided to set all these system variables. To run the indexer,

java cs276a.project.Index config.index

This will index all the pages in the webbase repository and will store all the forward links of each page under the directory “links”. This directory needs to be created before running the indexer.

To search,

java cs276a.project.Main

The MAX_LINKS and MAX_LEVELS are user specified parameters that are taken as input at search time.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download