Document Similarity in Information Retrieval
Document Similarity in Information Retrieval
Mausam (Based on slides of W. Arms, Thomas Hofmann, Ata Kaban, Melanie Martin)
Standard Web Search Engine Architecture
crawl the web
store documents, check for duplicates,
extract links
DocIds
user query
create an inverted
index
show results To user
Search engine servers
inverted index
Slide adapted from Marti Hearst / UC Berkeley]
Indexing Subsystem
Documents
documents
assign document IDs
text
break into tokens
document numbers
tokens
stop list*
and *field
non-stoplist
stemming*
numbers
tokens
*Indicates
optional
stemmed term weighting*
operation.
terms
terms with weights
Index database
Search Subsystem
query parse query query tokens
ranked document set
stop list*
ranking*
non-stoplist tokens
stemming*
stemmed
*Indicates optional operation.
Boolean retrieved operations* document set
relevant
terms
Index database
document set
Terms vs tokens
? Terms are what results after tokenization and linguistic processing.
? Examples
? knowledge -> knowledg ? The -> the ? Removal of stop words
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- python guide documentation read the docs
- pandas cheat sheet python data analysis library
- the python library reference university of idaho
- tkinter 8 5reference aguifor python
- introduction to python data types
- reading headers and calling functions
- python classes and objects george mason university
- python 3 tutorialspoint
- document similarity in information retrieval
Related searches
- latest invention in information technology
- current trends in information technology
- emerging trends in information technology
- top issues in information technology
- new trends in information technology
- information retrieval technique
- unsaved word document not in recovery
- using document templates in sharepoint
- document formatting in word 2013
- word document fill in blanks
- document control in sharepoint
- information retrieval system pdf