Search Engine Concepts and Techniques

[Pages:9]Search Engine Concepts and Techniques

SWE 642, Spring 2008 Nick Duan

February 27, 2008

1

Overview

Information Retrieval on the Internet

Types of digital information ? data and data format Types of delivery mechanism - Information flow How the game is played: From producer to consumer

Robots, Crawlers, Spiders, Agents

The discovery process Robot exclusion protocol

Search Engine Concepts

Types of Search Engines How Indexing and search works? Introduction to Apache Lucene and Nutch A Lucene example

Summary

February 27, 2008

Nick Duan

2

1

Information Retrieval on the Internet

Internet - The fourth industrial revolution

The exponential growth of the web has created unprecedented abundance of information ? Information overflow

The ultimate quest ? Retrieving the right information at the right time Computer tools are needed to help people with various IR tasks

February 27, 2008

Nick Duan

3

Types of Information &

Delivery Methods

Textual

Structured: Relational, SGML/XML, "strongtyped" documents Unstructured: free text, HTML, "weak-typed" documents

Binary

Images, Videos, Sounds, digitized sensor data Encoded text information: PDF, Word, etc.

Info delivery mechanism over the Internet

Unrestricted access: Pull (search) vs. Push (channel feeds) Restricted access: User security required

February 27, 2008

Nick Duan

4

2

How the Game is Played

Producers

Authors/initial content publishers/web-site hosts

Service Providers

Web-site hosts

Search engines (e.g. Google), Content aggregators (e.g. AOL), application-specific service providers (e.g. on-line merchants)

Consumers

Internet users, value-added re-publishers/content aggregators

Our focus is on discovery and search

Discovery: Web crawling to retrieve web content, and save the content in repository

Search: Index the web page repository, processing search queries, presenting the results

February 27, 2008

Nick Duan

5

Robots, Crawlers, Spiders, Aggregators

Different terminologies but the same information discovery function: Collecting HTML or other documents (textual or encoded textual) by following the URL links of each HTML page Recursive exploring of URLs or fixed URL set

You may specify the depth of the URL tree Keep track of URLs to prevent circular links Multi-threaded or clustered configuration to boost performance

Robot name to be specified in User-Agent:

Googlebot, MSNBot, Infoseek Sidewinder...

February 27, 2008

Nick Duan

6

3

The Web Discovering Process

Web sites

URL queue

For each of the URLs in the queue Is robot exclusion on? if yes, continue to next URL retrieve the web page parse its HTML content collect URLs ( ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download