Software Stack



TMC ProSearch

Overview and Training

Houston Academy of Medicine

January 7, 2011

Brian DeSpain

Larry Donahue

Suzanne Smith

1 Introduction 4

2 A Quick Overview 4

3 Running a Good Search 5

3.1 A precise phrase 5

3.2 Targeting the best sources 5

3.3 A Bit About Logic Operators 6

4 Getting Results 6

4.1 What happens when you run a search? 6

4.1.1 The role of connectors - querying and retrieving 6

4.1.2 Ranking 6

4.2 Sort by 8

4.3 Limit to 8

4.4 Narrow by (Clusters) 8

5 More Tools 8

5.1 Selections 8

5.2 Custom Searches 8

5.3 Alerts 9

5.4 Emailing and Printing Results 9

5.5 Citations 9

5.6 Summary of All Results 10

6 Something’s Not Right: When to Talk to DWT 10

6.1 Examples of Issues that Should be Reported 10

6.2 What to report to DWT when a problem comes up 11

The User Experience in TMC ProSearch

Appendix 1: Simple Search Page 10

Appendix 2: Advanced Search Page 11

Appendix 3: Results List 12

Appendix 4: Unauthorized Users Cannot View Results 13

Appendix 5: Viewing Results 14

Appendix 6: Your Selections / Download Citations 15

Appendix 7: Email Results 16

Appendix 8: Print Results 17

Appendix 9: Bookmarking 18

Appendix 10: Session Preferences 19

Appendix 11: User Login 20

Appendix 12: Alerts 21

Appendix 13: Custom Search 22

Overview

HAM/TMC’s ProSearch searches your organization’s subscription material and presents results to your researchers in one aggregated list of links.

If you could leave here today knowing only a few things about TMC ProSearch, these might be the most important concepts:

1. ProSearch searches all of the databases listed on the advanced search page simultaneously, in real time (or just those databases the user selects.)

2. ProSearch queries every database the user selects by filling out the search form provided by the database publisher, and then retrieves the results returned by each of those database searches. Prosearch does not crawl and index database content.

3. ProSearch aggregates, deduplicates and ranks all of these results into one results list.

4. Incremental results are at the heart of how ProSearch works. As the individual content sources return results to ProSearch, they are displayed. This means that the search will take several seconds to complete, and that as more results come in, they will be added to the results set in ProSearch.

5. Valuable subscription content is protected because results links lead users through the library proxy before landing on the desired research material.

Before we get started, let’s quickly review the parts of the search so we’ll have the same ideas in mind as we refer to the various tools and their locations. Please feel free to ask questions about anything you see on the pages of ProSearch.

• Basic Search

• Advanced Search

• Tools on the search and results pages

Running a Good Search

1 A precise phrase

Some of the most common questions our users have are about understanding and optimizing queries. Most deep web search engines are federated, which means the query is passed simultaneously to multiple engines. If the federated search provider has done their job, then the query should seamlessly match to the logic of the native interface. At Deep Web Technologies, we develop our connectors to mimic the capabilities of the native interface. However, connectors cannot exceed the capacity of the native interface in a federated solution. Very often, because we have made the federated search so seamless to our users, they make (reasonable!) assumptions about the capabilities of the sources being searched.

In some cases, bad results are returned by collections because they have a low-quality search engine that doesn’t handle multiple Boolean terms, exact-phrase searches, or large-phrase searches, well or at all.

The more complex the query you issue, the fewer number of ranked results you will get. Our recommendation, when using federated search, is to try simpler queries first.

Here are some simple rules to remember:

1. Not all sources are created equal. Your complex query might run well on one source, but 15 ORs strung together with 4 NOTs isn’t going to run well on most search interfaces. (That’s a real life example, by the way!) So start the query simply and then use a search-refining mechanism, like clustering, to really narrow your query.

2. Start queries simply. Ask for exactly the results you want, and don't worry about getting too many results. Again, ProSearch can help you refine results once you have them.

3. Highly technical, uncommon keywords get better results. Remember that ProSearch's sources are more technical and have much more detail than the sources you would search with, for instance, Google.

4. Fielded search is your friend. Most of ProSearch's sources support fielded search. Users should leverage this to get relevant, reasonable responses.

Let’s practice by performing a basic search using these principles and looking at the results ProSearch retrieves.

2 Targeting the best sources

Remember that you don’t have to search all of the databases ProSearch is capable of searching. Sometimes, you can target information more effectively by making decisions about what databases you’d like to query. For instance, you will naturally get less useful results for the search phrase “multi-infarct dementia” from an immunology journal than you will from a neurology journal. When you want to select only specific sources, simply go to the Advanced Search page and deselect the databases you don’t need.

Let’s practice by running a search against only a few, subject-specific sources.

3 A Bit About Logic Operators

TMC ProSearch supports the use of the following in search terms:

• AND, OR, and NOT

• Multiple search terms (ProSearch will assume AND between search terms unless the user specifies otherwise.)

• Exact phrase searching - search terms contained in double quotation marks “ ”

• Nested search terms - search terms in parentheses ()

• Wild cards:

o * - to search for the term with 0 or more additional characters.

o ? - to search for the term with only 1 additional character (characters include numbers, letters and punctuation).

Let’s practice by running a search using some operators you’d like to see in action.

Getting Results: What happens when you run a search?

1 The role of connectors - querying and retrieving

Most simply put, a connector is a link between two data structures. In ProSearch, the connectors link the software to the information sources being searched. When a user enters a query in the user interface, it is the connectors that query each database's search interface and then bring back the requested information.

Deep Web Technologies has spent a lot of time building the connector framework we call C2. This framework is based on JRuby, and is designed to handle the most complex APIs and authentication mechanisms used to secure content in the deep Web. Additionally, it’s extensible enough to allow the full range of metadata that might be found at specific sources to be extracted. By contrast, many traditional federated search providers simplify the metadata returned, or keep it to a subset that all sources share.

2 Deduplicating (Deduping)

One of the critical problems in federated searching is de-duplication of results. Many sources contain the same journal articles and,  clearly, presenting the same result multiple times isn’t useful to users. ProSearch de-dupes on multiple fields to ensure that users don’t see duplicate results across multiple source. The application has conditional logic which compares various fields to see if the results would be considered a duplicate.

The de-duplication mechanism can use multiple fields that can be compared using boolean logic. This means that various fields can be matched (field A OR field B) or (field C AND field D) using boolean operators. If either condition is true, the result is determined to be a duplicate and removed according to the source de-dupe order. Additional fields can be added to the mix to improve accuracy for example, (field A + field N OR field B + field D).

De-duplication order specifies which sources take priority over other sources. This order can be controlled allowing customers to specify the source order for de-duplication. Sources lower on the de-dupe list which have results determined to be duplicates will have those removed from the list. You could look at de-duplication order this way: you have source A, source B, and source C in your federated search application. Source A has a de-dupe order of 1 (this means this source’s results will be the highest priority). Source B has a de-dupe order of 10, and source C has a de-dupe order of 5. This means if the same result is in both source B & C, the result from source C will be displayed. If the same result is in Source A, only that result will be displayed.

We have found the following de-duplication most effective for our federated search applications. The application first checks the full text URL of the result. If two results have the same full text URL, then it’s assumed that they are duplicates, the application will then not display the results from the source with the higher priority de-duplication order. Next we check a combination of the title of the article and the publication date. If these two fields match, the results are considered duplicates and the lower priority results are removed.

Finding the right balance of fields for de-duplication in a federated search application can be difficult, but Deep Web Technologies has the capability and the knowledge on what fields are best for your sources and your search needs.

3 Ranking

Ranking is based on relevance, with the most relevant results becoming the highest ranking results. We compute relevance by (1) creating root-words for the query terms and results, (2) conducting relevance weighting for a number of factors, and then (3) using our proprietary algorithms to assign rank for each result in the list of results.

(1) Creating Root-Words: Stemming

Stemming is the process of converting words to their base – or root – words. In the simplest case, it makes sure that a pluralized search term will find singular terms in the results, and visa-versa. This can be simply dropping “s” or “es” from words (in English), but the process can become more complex. Consider “mouse/mice” and “person/people”.

The specific stemming algorithm we use is the Porter Stemming Algorithm.

For the most part, we do not need to stem search terms before submitting them to the collections we search.  Occasionally, we may need to explicitly indicate to a collection that we want to perform a stemmed search or an exact search.

(2) Conducting Relevance Weighting

We analyze search term occurrence within a search result, and assign weights for different factors. We look for occurrence of exact terms and stem terms. We can assign relative weights to different results fields.  We can also assign higher weights to results from a more important collection as well as assign a higher weight to more recent results.

We also consider:

• Search Term Position – We examine where search terms appear within particular fields (i.e. title, author, snippet) and affording special consideration for whether a search term occupies the first word position, last word position, or relative position to either.

• Search Term Density – We find significance in how often search terms appear within fields (i.e. individual fields and full record). Aside from counting the number of occurrences of search terms within fields, we consider the ratio of search term length to result field length. For example, a one-word title that is the same as the search term would be highly relevant.

• Search Term Proximity – We consider how close search terms occur relative to one another. When evaluating this, we look at the number of search terms within the query expression and the distance between reoccurring search terms. In returned results, this ratio, in conjunction with the length of the fields, can be significant.

• Search Term Ordinality – If search terms are in the same order, as was specified in the search expression, this can be significant and is afforded greater weight than if the search terms are not in order as the search expression. Likewise, multiple occurrences of ordinality are important.

(3) Proprietary Algorithms

Once we’ve analyzed the exact search terms and stemmed search terms, against the factors above and assigned weights, we use our proprietary algorithms to assign an actual rank.

These algorithms operate on the Boolean operators AND, OR and NOT. The search query expression is evaluated from left to right. Exact phrases (contained within double-quotation-marks) are not stemmed!

If a date range is specified, the date is used as a constraining term, provided that a date is supplied in a result. If a date is not supplied in a result, the relevance for that result is assumed zero (i.e. not ranked). Note that such results may still show in the results list.

Finally, stop words are words considered irrelevant for searching purposes. We don’t evaluate them. The current list of stop words contains a, an, at, be, but, do, however, and so on (and the full list of stop words is available if you're really interested!)

More Tools

1 Sort by

ProSearch provides the option to sort by rank, date, title, or author. By default, results are sorted by rank. When a user chooses, for instance, to sort by "title," those results will be "re-ranked" so that results with fewer rank stars may appear higher on the list, if their titles are determined to be more relevant to the user's search.

Let’s try sorting a results set using each option.

2 Limit to

Users also have the option to limit their results set to only those results from a particular database. All other results will be filtered out of the results set.

Let’s limit our results set by database.

3 Narrow by (Clusters)

The "Narrow by," or clustering, feature, gives users yet another powerful way to navigate results. ProSearch’s clustering engine analyzes results and produces “clusters," which superficially resemble the output generated by the keyword-based systems and fixed taxonomies of other search engines. DWT’s clustering technology, however, is more akin to a document-discovery engine.

This unique approach to clustering is taken from Latent Semantic Analysis (or LSA). LSA is a fully automatic mathematical/statistical technique for extracting and inferring relations of contextual usage of words in search results. This technology provides a concept-based approach to analyzing and clustering results from a result set. Applying the LSA approach, our clustering engine analyzes the relationships between a set of documents and the terms contained within the documents to produce a set of concepts related to the results.

Let’s see what happens when we navigate through our results set by clustering.

4 Selections

The Your Selections feature gives users a simple way to keep a list of results of interest while they browse. Selected results are saved in a list for the duration of the user’s browser session, even while the user goes on to perform additional searches. Results stored in Your Selections can then be printed, emailed, or downloaded.

Let’s select some results and see what the list looks like. Then we’ll clear the selections.

5 Custom Searches

Users can take advantage of Custom Search to build a tailored search page that contains the collections and search fields they desire, using those included in ProSearch. Search pages can be organized to cater to specific departments, workgroups, academic courses or individual researchers. Not only can users quickly create their own federated search engine, they can also share or incorporate it into their own web page or blog using an easily added snippet of code (a “widget.”)

With Custom Search, users can:

• Add or remove collections at any time

• Personalize federated search engines

• Create new engines as often as needed

• Make research more efficient by searching only important, relevant collections

• Generate widgets for fast searching

Let’s practice creating a custom search.

6 Alerts

Alerts make it easy to track information, discover what other people might be doing in your field, or monitor a specific topic you're interested in. After creating an account, you can login into the alerts section of TMC ProSearch. Click on the “create” button and you’ll be presented with a form for creating an alert.

Alerts allow you to monitor a specific search term in a wide variety of search fields. Users can search in full text, title and author, or also monitor multiple fields at the same time. This allows you, for example, to monitor the work by a specific author on a specific topic.

You can also select which databases on which to run an alert, and the alerting interval (most users select weekly). You’ll have the option of having your alert delivered either via RSS or ATOM. You can take that RSS feed and consume it in an internal CMS or add the feed to your blog. These tools make it easy for advanced users to share the information with other researchers or consume the information in your intranet.

Let’s create and test an alert.

7 Emailing and Printing Results

Most users will find the Email Results and Print Results tools quite intuitive. You can adjust the number of results that are emailed or printed. Also note that with email results, you can choose to email the URLs only if preferred, and also HTML or text email format.

Let’s print and email some results from the results page, and from our Selections page.

8 Session Preferences

This tool allows you to set certain preferences for the duration of your session. You may choose the number of results to display per page, and you may also choose whether to display a notice alerting you when additional results have become available as your search completes. (Remember that incremental results, as they are brought into your ProSearch search, can cause re-ranking of the results set that you're viewing.)

Let’s adjust our preferences.

9 Citations

Accessible through Your Selections, the Direct Connect feature provides the ability to download selected results to EndNote or RefWorks.

It is important to note the difference between citations provided by content publishers vs. academic citations. DWT can frequently improve the citations in ProSearch by taking advantage of the connector's ability to get more metadata, whenever it is provided by the publisher. However, article citations won't exactly match the academic citation format that some researchers might expect (APA, MLA or Chicago.)

10 Summary of All Results

The Summary of All Results provides information about the results that were available to ProSearch when the current search was run. There are three pieces of information given per source:

1. The source health, shown by a green check, red X or timeout symbol. If a red X appears, information about the error is available in the rollover text.

2. The number of "ranked" results, or the number of results determined by ProSearch to be relevant from the total results provided by the source.

3. The number of "available" results, or the total number of results provided by the source including those ProSearch determined were not relevant to the user's search.

Let’s have a look at the Summary of All Results.

Something’s Not Right: When to Talk to DWT

1 Examples of Issues that Should be Reported

While you should feel free to ask us any time you think there may be a problem with ProSearch, we understand it’s good to have a list of possible issues that would require our investigation and correction.

• Results links from a source stop working or point to an unexpected page.

• Some or all results links from a source are not active (are not underlined and no action takes place when you click them).

• Results links take you to a page with a message like “Sorry, the resource you requested has moved”

• Results links take you to a page with a message like “To subscribe to this resource...”

• Results links take you to a browser error page with a message like “Sorry, Firefox can’t find the server at…” (make sure to tell us what this message is!)

• A source persistently gives a red x

• A source gives a green check, displays “0 available, 0 ranked” and never returns results

• A source gives a green check, displays “somenumber available, somenumber ranked” but no results from that source appear in the results set

• A publisher is kind enough to warn you about a change to any of their search functionality - or you find out about this change by accident.

• Changes to your EZProxy or other network changes take place that might affect users’ ability to access off-network resources, or the ability of off-network resources to access TMC/HAM’s network.

• The search becomes unexpectedly slow.

• A reasonable search term across more than one source returns no results, only the message “no results available.”

• Worst case scenario: the search is unavailable. TELL US EXACTLY WHAT IS HAPPENING!

2 What to report to DWT when a problem comes up

• What is the error? For example: a red X in the Summary of All Results

• Do you see an error message? Record this (as closely as you are able.)

• Does the error happen consistently, or is intermittent?

• What are the exact steps to reproduce the error? For example, the search term, the sources searched, any steps you took to refine your search, and so on.

• Anything else that you feel might be useful information - it is much easier for DWT to diagnose and correct a problem if we get too much information from a user than not enough.

Appendix 1: Simple Search Page

Users can initiate a federated search of multiple sources by entering their search term at a single search portal.

[pic]

Appendix 2: Advanced Search Page

[pic]

Appendix 3: Results List

[pic]

Appendix 4: Unauthorized Users Cannot View Results

Off-network users will be challenged for credentials. Any researcher who can provide valid credentials can proceed to the subscription content.

[pic]

Appendix 5: Viewing Results

On-network users and users who are off-network but have valid credentials can click through to the desired search results.

[pic]

Appendix 6: Your Selections / Download Citations

[pic]

Appendix 7: Email Results

[pic]

Appendix 8: Print Results

[pic]

Appendix 9: Bookmarking

[pic]

Appendix 10: Session Preferences

[pic]

Appendix 11: User Login

[pic]

Appendix 12: Alerts

After creating a login, users can create automatic alerts.

[pic]

Appendix 13: Custom Search

Registered users can also create custom searches.

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download