Document Similarity in Information Retrieval

[Pages:60]Document Similarity in Information Retrieval

Mausam (Based on slides of W. Arms, Thomas Hofmann, Ata Kaban, Melanie Martin)

Standard Web Search Engine Architecture

crawl the web

store documents, check for duplicates,

extract links

DocIds

user query

create an inverted

index

show results To user

Search engine servers

inverted index

Slide adapted from Marti Hearst / UC Berkeley]

Indexing Subsystem

Documents

documents

assign document IDs

text

break into tokens

document numbers

tokens

stop list*

and *field

non-stoplist

stemming*

numbers

tokens

*Indicates

optional

stemmed term weighting*

operation.

terms

terms with weights

Index database

Search Subsystem

query parse query query tokens

ranked document set

stop list*

ranking*

non-stoplist tokens

stemming*

stemmed

*Indicates optional operation.

Boolean retrieved operations* document set

relevant

terms

Index database

document set

Terms vs tokens

? Terms are what results after tokenization and linguistic processing.

? Examples

? knowledge -> knowledg ? The -> the ? Removal of stop words

Matching/Ranking of Textual Documents

Major Categories of Methods 1. Exact matching (Boolean) 2. Ranking by similarity to query (vector space model) 3. Ranking of matches by importance of documents

(PageRank) 4. Combination methods

What happens in major search engines (Googlerank)

Vector representation of documents and queries

Why do this? ? Represents a large space for documents ? Compare

? Documents ? Documents with queries

? Retrieve and rank documents with regards to a specific query

- Enables methods of similarity

All search engines do this.

Boolean queries

? Document is relevant to a query of the query itself is in the document.

? Query blue and red brings back all documents with blue and red in them

? Document is either relevant or not relevant to the query.

? What about relevance ranking ? partial relevance. Vector model deals with this.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download