Document Similarity in Information Retrieval
[Pages:60]Document Similarity in Information Retrieval
Mausam (Based on slides of W. Arms, Thomas Hofmann, Ata Kaban, Melanie Martin)
Standard Web Search Engine Architecture
crawl the web
store documents, check for duplicates,
extract links
DocIds
user query
create an inverted
index
show results To user
Search engine servers
inverted index
Slide adapted from Marti Hearst / UC Berkeley]
Indexing Subsystem
Documents
documents
assign document IDs
text
break into tokens
document numbers
tokens
stop list*
and *field
non-stoplist
stemming*
numbers
tokens
*Indicates
optional
stemmed term weighting*
operation.
terms
terms with weights
Index database
Search Subsystem
query parse query query tokens
ranked document set
stop list*
ranking*
non-stoplist tokens
stemming*
stemmed
*Indicates optional operation.
Boolean retrieved operations* document set
relevant
terms
Index database
document set
Terms vs tokens
? Terms are what results after tokenization and linguistic processing.
? Examples
? knowledge -> knowledg ? The -> the ? Removal of stop words
Matching/Ranking of Textual Documents
Major Categories of Methods 1. Exact matching (Boolean) 2. Ranking by similarity to query (vector space model) 3. Ranking of matches by importance of documents
(PageRank) 4. Combination methods
What happens in major search engines (Googlerank)
Vector representation of documents and queries
Why do this? ? Represents a large space for documents ? Compare
? Documents ? Documents with queries
? Retrieve and rank documents with regards to a specific query
- Enables methods of similarity
All search engines do this.
Boolean queries
? Document is relevant to a query of the query itself is in the document.
? Query blue and red brings back all documents with blue and red in them
? Document is either relevant or not relevant to the query.
? What about relevance ranking ? partial relevance. Vector model deals with this.
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- online edition c 2009 cambridge up stanford nlp group
- document similarity in information retrieval
- information retrieval
- information retrieval stanford university
- information retrieval search process techniques and
- introduction to information retrieval mathunipd
- basic concepts of information retrieval systems
- radudaniel algorithms for information retrieval introduction
Related searches
- latest invention in information technology
- current trends in information technology
- emerging trends in information technology
- top issues in information technology
- new trends in information technology
- information retrieval technique
- unsaved word document not in recovery
- using document templates in sharepoint
- document formatting in word 2013
- word document fill in blanks
- document control in sharepoint
- information retrieval system pdf