There have been a number of protocols devised and ...



Introduction

Domain-specific search engines are now common in nearly all topic areas. Because they are smaller and focus on particular topics, such sources can provide more focused search compared to general-purpose search engines such as Google. Assuming similar page ranking schemes, a search on a smaller, more focused search should provide better results.

Metasearch applications integrate search results from multiple information sources. Metasearch can expose users to unfamiliar information sources, it can filter and rank results from the various sources, and it can process, unify, and personalize the results for human consumption.

Today’s metasearch applications are based on fixed lists of information sources. For instance, A9 provides access to a set of sources including Google Images, Amazon’s Search Within a Book, and the International Movie Database. provides access to Google, Yahoo, Kanoodle, and others. Such metasearch clients are developed through the implementation of custom scrapers or web service consumers for each individual information source. When one of these metasearch systems wishes to add a new source to those available, a programmer must modify the system’s software.

Such a programmer-based process of updating metasearch clients is not in line with the dynamic nature of information today, as the number of search engines increases at a rapid rate. This rate should only increase: Moore’s Law has made crawling and the creation of topic-specific search engines something that individuals can initiate even on their personal computers. Just as HTML democratized and exploded the creation of web pages, factors today are leading to the explosion of searchable subsets of the web.

Information seekers can try to manually keep up with search engines relevant to their information needs, but finding the pertinent sources on a particular topic is becoming more and more difficult.

What is needed are search engines for search engines, metasearch clients that help users find the right information sources and then ultimately the right information. To handle the dynamic nature of information, these clients must work with a list of sources that dynamically updates without the need for a programmer.

Two elements are required to make such a system a reality: a common search API, and a registry at which sources can identify themselves. These elements render the development of information sources and metasearch clients independent. Search engine creators, by conforming to the API and registering, facilitate their engines being instantly available to all Metasearch clients. Client software dynamically accesses the registry to build a list of the currently available sources, and invokes searches on any source in the list using the operations of the API. In this way, the information space can grow in a grass-roots manner.

In this paper, we describe the architecture of such a system, Webtop. It consists of a search API that we call the Universal Search Protocol (USP), a UDDI-based registry system, and MyGoogle, a desktop application that allows ordinary users to expose parts of their desktops as information sources.

To bootstrap the system, we have developed a number of USP-conforming information access services, including ones that access information from Google, Amazon, Technorati, Feedster, and the Internet Archive’s digital library. The MyGoogle application allows users to turn their desktops into USP-conforming services

We have also developed Windows and web metasearch clients, based on the architecture. These clients allow users to send a single query to any of the traditional search engines as well as the personal search engines. For example, a user interested in metasearch might select Google and “David Wolber” as the search sources before sending a query.

Universal Search Protocol

Distributed computing and remote procedure call mechanisms have been around a long time -- DCOM, Corba, and RMI to name a few. Recently, standards have emerged based on HTTP and XML: WSDL for publishing the interfaces to remote procedures, SOAP to actually make the remote calls, and UDDI for registering services.

One benefit of this emergence is that most development environments now provide support so that programmers can code objects and functions in their preferred language, with the environment handling the plumbing, i.e., the generation of WSDL and the conversion of function calls to distributed SOAP calls.

The standards give businesses within a particular area the mechanisms necessary to agree on and implement protocols within the domain. Given an agreed upon WSDL file specifying the services that a business in the domain should provide, businesses are free to develop services on any platform and using any development language/environment. Client applications can then use UDDI registries to find particular business services within the domain, and access the services using the standard defined in the agreed upon WSDL file. When new business services are implemented and registered, the clients can access them immediately and without the client program being modified. This open process is a key to the proliferation of B2B applications and in general automating much of the communication processes of the world.

Whereas most standards have come from particular domains or topic areas, there is also the potential for cross-domain protocols, and in particular, protocols for search-related services. Standard protocols for search would enhance many applications, including those applications commonly known as metasearch systems, as well as less traditional applications with integrated search services. A search standard would allow applications to provide dynamically growing lists of information sources to their users, helping them discover new information sources and not restricting them to the a list of sources built-in at the time the client application was built.

Notes on UNIVERSAL SEARCH API

Overview of the key methods that it provides.

compare with START SDARTS

Details on each… here are some notes…

sources that send documents over (e.g., personal sources)

some type of inheritance????

images

comparison to firefox, in which one must submit to administrator and tell it something about how the results are

Keyword search

in parameters

keywords -- either as a single string or as a list of words/phrases.

restrictions –date, etc. things found on an advanced search window. Maybe a sub-library…eg for google, News or Groups… the alternative is that sublibraries would be implemented as separate services…I think the current api has something called “category”.

count

out parameter

total number of results

results – would be nice if it had a standardized text matching ranking as well as a popularity measurement…maybe even lower level such as number of hits, number of fancy hits, etc., then a client could do what they wanted…maybe some way for the client to specify how to rank.

GetCitations (inward links)

in parameter

metadata – metadata about the thing you want to get inward links to. Metadata object has fields such as title, url, maybe some source specific id…the source then deals with it however it can… The alternative to such a scheme is to make the client query the registry to see what a source does provide…e.g., does it provide a getCitations(url)

Note with restful the client could send tagged parameters, e.g. url=xxx or title=yyy

out parameter

total number of results

results – here results are ranked only on popularity

Get Outward Links

This one is a bit confusing as for somethings the client can compute outward links itself, i.e. if the client wants the outlinks of a url, he can just parse it.

However, outward links might also be links other than hrefs. For instance, a law document will contain references to cases, e.g., Wolber vs. US. A law service parse such stuff and send links to the referred to cases.

REGISTRY

UDDI has emerged as an XML standard for web service registries. Because UDDI is a universal protocol that allows all types of services to register, we developed a layer on top of UDDI that provides specific support to metasearch clients. In particular, the WebTop registry provides metadata about each source, including vocabulary information as was done with the SDARTS initiative, and it also compiles data used to measure a source’s reputation.

The key interface to the registry is the getSources method. It returns a list of all registered sources including the following data:

endpoint url

which of the api methods it provides

reputation measures

We have built a layer on top of uddi…need to talk about why and how we are different.

PUBLISHME

The USP and registry provide programmers with the ability to create search areas that are immediately accessible to WebTop clients. We also provide an application, PublishMe, that allows ordinary computer users to create and publish parts of their desktops as search areas.

PublishMe is similar to Google Desktop in that it builds and maintains a search area from a user’s desktop (documents, email, etc.). Google has been careful, due to people’s privacy concerns, to implement and characterize the desktop search area as one which only the user herself can access. PublishMe, on the other hand, provides facilities so that a user can publish her “desktop”, or parts of it, as a USP-conforming web service running directly on the user’s personal computer. PublishMe registers this search area and service with the WebTop registry so that, in effect, a user can make their desktop immediately available to all WebTop clients.

The motivation behind PublishMe is that many of us create knowledge every day, but …

Experts.

PublishMe consists of a dialog for specifying what parts of the desktop are “open”, a file system crawler that builds the search area, a USP-conforming web service, and a tiny Cassinni server that, when deployed, responds to queries from the outside world.

Currently, access specification is rudimentary: the user can specify folders from their file system which serve as top-level roots of the search area. Beings as privacy is an incredibly important issue, we plan to add sophistication to the access specification including the ability to specify group access. See X for a discussion of privacy.

The file system crawler begins at the top-level roots and builds two data structures: an inverse index for keyword search, and a link base describing the relationships between documents. Note that the crawler considers a directory as a list of links so that directory-contains-file is treated as an outward link just like an href found within a file. Bookmarks are considered as well—in fact the bookmarks directory is by default selected as a top-level root. The linkbase is bi-directional so that the outside world can query a desktop to see if it has documents that link to a particular url (inward link).

The crawler runs as a background process that is invoked periodically to keep the inverse index and linkbase consistent with the file system. We are also experimenting with handling file system events to help with this process.

The Cassini server deploys a single web service that conforms to USP and uses the data compiled from the file system crawl to respond to queries. Upon user login, the server is deployed and an on-line message sent to the WebTop registry. When the user logs off, the registry is also notified.

CLIENT APPLICATIONS

We have developed three client applications based on the Universal Search Protocol (USP) and the WebTop registry (see webtop.cs.usfca.edu). These clients serve as proof-of-concept for the overall system, but are also interesting metasearch clients in their own right.

The first client we developed is a downloadable Windows application. This client is a search-enhanced file manager. Documents from the local file system and those retrieved from searches are displayed uniformly within a tree-view. Whereas, in a traditional file manager, the user can only expand folders, in this client the user can expand both folders and documents. Expansion of any node results in information queries being sent to selected search areas, and the results being displayed at the expanded level. The user specifies which queries are invoked on expansion by selecting the active sources and active associations. Associations include out-links, in-links, and similar-content links.

[pic]

The snapshot above shows the WebTop web client. Four “preferred” sources are shown, including one, David Wolber, that is a personal search area. The user can access the entire list of WebTop registered source by clicking on the “More” button.

The user has selected Google and Feedster as the active information sources, and performed a traditional search with the keyword “metasearch”. The system has responded by listing three results from both Google and Feedster.

Next, the user clicks on the + next to the third result, expanding “Mamma Metasearch”:

[pic]

Because the associations “Keyword” and “Inward” are selected, the system sent both a keyword search query and an inward link query to the active sources. For “Keyword” expansions, the system performs TFIDF on the document to come up with a set of characterizing words. In this case qtypeselected, arial, and qtypes were extracted from the Mamma metasearch page and sent to both of the search engines. Neither search engine returned results for that combination of words. Note that the system lists the automatically identified query words in the right-top corner instead of hiding this automation from the user.

The “Inward” link query did provide the three results from Google, each of which displays an inward arrow to the left. Each of these documents, e.g., the one titled “PUC Library” contains a link to Mamma Metasearch (or at least Google thinks it does). As no results were returned from Feedster, none of the documents in its database point to Mamma Metasearch.

Note that the “Outward” association is not selected as an association type. If it were, the expansion would have also displayed hyperlinks found within Mamma Metasearch. Outlinks, in the current system, do not result in queries to information sources, but are handled by the client parsing the document itself.

Discuss code within client…getSources…list of sources, modifying the url to call a particular source…web service polymorphism…

CUTS

[definition of metasearch from sdarts paper:

a metasearcher performs three main tasks. After receiving a query, it determines the best database to evaluate the query (database selection), it translates the query in a suitable form for each database (query translation), and finally, it retrieves and merges the results from the different sources (results merging) and returns them to the user using a uniform interface.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download