Document Similarity in Information Retrieval

Document Similarity in Information Retrieval

Mausam (Based on slides of W. Arms, Thomas Hofmann, Ata Kaban, Melanie Martin)

Standard Web Search Engine Architecture

crawl the web

store documents, check for duplicates,

extract links

DocIds

user query

create an inverted

index

show results To user

Search engine servers

inverted index

Slide adapted from Marti Hearst / UC Berkeley]

Indexing Subsystem

Documents

documents

assign document IDs

text

break into tokens

document numbers

tokens

stop list*

and *field

non-stoplist

stemming*

numbers

tokens

*Indicates

optional

stemmed term weighting*

operation.

terms

terms with weights

Index database

Search Subsystem

query parse query query tokens

ranked document set

stop list*

ranking*

non-stoplist tokens

stemming*

stemmed

*Indicates optional operation.

Boolean retrieved operations* document set

relevant

terms

Index database

document set

Terms vs tokens

? Terms are what results after tokenization and linguistic processing.

? Examples

? knowledge -> knowledg ? The -> the ? Removal of stop words

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download