Search Technologies Assessment - Archives



National Archives and Records AdministrationNational Archives Catalog (The Catalog)NARA Catalog Web Sites Data Model Design– Catalog Perspective –Status-FinalVersion 1.4July 24, 2015National Archives & Records AdministrationNARA Catalog Web Sites Data Model Design Avi Rappoport, Madhu Koneni, Kristen Martin, Terri Hobbs, Paul NelsonVersion 1.43Contract Number GS-35F-0541UOrder Number NAMA-13-F-0120July 24, 2015Contents TOC \o "2-3" \h \z \t "Heading 1,1" 1Overview PAGEREF _Toc425518323 \h 21.1NARA Web Site Content PAGEREF _Toc425518324 \h 21.2What is a DMD? PAGEREF _Toc425518325 \h 31.3Document Conventions PAGEREF _Toc425518326 \h 42Web Sites and Content Extraction PAGEREF _Toc425518327 \h 52.1Web Sites and Sub-Sites PAGEREF _Toc425518328 \h 52.2Standard Metadata PAGEREF _Toc425518329 \h 82.3Metadata Parsing for the Site PAGEREF _Toc425518330 \h 92.4Metadata Parsing for NARA Blogs PAGEREF _Toc425518331 \h 102.4.1Parsing Date for AOTUS Blog PAGEREF _Toc425518332 \h 112.4.2Metadata Parsing for other National Archives Blogs PAGEREF _Toc425518333 \h 112.5Presidential Libraries PAGEREF _Toc425518334 \h 122.6Processing Non-HTML content PAGEREF _Toc425518335 \h 132.6.1Text Extraction PAGEREF _Toc425518336 \h 132.6.2Metadata Mapping for Non-HTML Content PAGEREF _Toc425518337 \h 133File Processing PAGEREF _Toc425518338 \h 144Mapping to Index Fields PAGEREF _Toc425518339 \h 154.1Field Metadata Mappings PAGEREF _Toc425518340 \h 154.2Mapping to Keywords Relevancy Model PAGEREF _Toc425518341 \h 165Search Fields PAGEREF _Toc425518342 \h 176Search Results Presentation PAGEREF _Toc425518343 \h 186.1Search Results (aka “Brief Results” Display) PAGEREF _Toc425518344 \h 187Content Details Presentation PAGEREF _Toc425518345 \h 19Version ControlVersionDateReviewerSummary Description0.12013-10-25Paul Nelson and Madhu KoneniInitial Outline0.22013-12-30Rhea MandavilliUpdates to Sections 1, 6.1.1, 70.32014-1-2Archana Ballur NagarajUpdates to Section 2, 3, 4, 5, 6.1, 6.2(ongoing)0.42014-1-3Avi RappoportAdding blogs, presidential libraries, updates to all sections0.52014-02-07Madhu KoneniFeedback from NARA on 02/04/2010 addressed0.62014-02-22Elizabeth HobbsReformatted0.72014-02-23Elizabeth HobbsRemoved websites not being considered. Updated index mappings and search results.1.02014-02-24Paul NelsonTop to bottom review and cleanup.1.12014-03-19Lisong LiuUpdated based on NARA review and feedback in DCRF of 3/11/141.22014-11-14Kristy MartinRemoved “Confidential to Search Technologies” text from the footer.1.32015-06-11Jose Hernandez, Kristy MartinAdded the remaining sites specified in DE-60; section updated: 2.1Changed branding for system name throughout document.1.42015-07-24Kristy MartinUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx”OverviewThis is the Data Model Design (DMD) for NARA Web Sites data source, which includes (and sub-sites such as blogs) and the presidential library web sites.This document aims at providing detailed documentation of all fields which come from these web sites and how they will be processed through the National Archives Catalog system. The DMD identifies and defines the following for the Web Sites data source:Metadata elements parsingMapping of the metadata elements to the Search Engine Index fieldsHow the search results are formatted (brief results)URLs indexed into the Catalog from web sites are unusual (when compared to other Catalog data sources) in that they do not participate in many aspects of the Catalog:Web pages cannot be annotated (tags, comments, translations, or transcriptions)There is no “content detail” for web pages.Instead the user is simply taken directly to the web page.There are no custom advanced search fields for web pages.Therefore, these sections will not be required in this DMD.NARA Web Site ContentThe main web site is extremely valuable for introductory and reference material about the many resources of the National Archives, from detailed databases to exhibitions. These pages answer common questions of the general public and beginning searchers. In addition, blogs within the domain also provide useful information and insights into current issues and activities of the National Archives and Records Administration.Presidential Libraries are archives for preserving and making accessible papers, records, and other historical materials of U.S. Presidents and their administration for public study and discussion without regard for political considerations or affiliations. Presidential Libraries are important sources for historians, researchers and anyone studying our presidents and history. Web pages indexed from these resources are presented as part of search results, allowing users easy access into this information.What is a DMD?The purpose of a Data Model Design document (DMD) is to document and describe all relevant data fields from a data source necessary to support all Catalog functionality. This metadata includes all data fields and their structure (nesting, type, number of values, etc.).The DMD further describes how metadata values are transformed and stored within the Catalog. This careful accounting of data processing is require to gain a complete understanding of how every field is handled through the Catalog system.Finally, the DMD also describes how metadata values are presented to the user, in the brief results, on the content detail page (aka the “full results”), from API calls, and in various metadata downloads.The following diagram shows how metadata is processed and mapped for the Web Sites DMD:Note that the major section numbers are maintained across all DMDs. So even though sections 3 and 5 are not required in this DMD, empty place-holders will remain so that numbering is consistent when comparing the Web Sites DMD with other DMDs for other data sources.Specifically, this DMD includes the following:Extracting metadata and text content from web pages.This includes key metadata fields (title, web area) as well as text content.Index Representation (section REF _Ref381051503 \r \h 3)How the web site metadata fields are represented in the search engine indexesBrief results presentation (section REF _Ref381051505 \r \h 5)Identifies how index fields are mapped to show the brief results for web pages.Document ConventionsSince there are many different metadata fields for different purposes and from different systems, field mappings will be used throughout this document to clearly identify the originating source for every field, as follows:AbbrevDescriptionWEBThis abbreviation will be used for any metadata from a web page.For example: WEB/title will represent the <title></title> field extracted from the web page.IFields from the search engine index.For example, “I/title” will represent the title as it is stored in the search engine index for the record.Web Sites and Content ExtractionThis section covers the web sites crawled by the Catalog web crawler and the metadata extraction and content processing/parsing required for each site.Web Sites and Sub-SitesThe web sites and sub-sites to be crawled are listed below.Root URLWeb Web Site National Archives: Major Areas within Web Site National Archives: About Archives: Calendar Archives: Calendar Archives: Contact Archives: Doing Business with NARA Archives: En Espa?ol Archives: Equal Employment Opportunity Program Archives: FAQs Archives: Federal Employees Archives: Federal Records Centers National Archives: Federal Register Archives: Forms National Archives: Genealogy Archives: Grants Archives: Jobs, Internships & Volunteering Archives: Legislative Branch Archives: Locations Archives: Members of Congress National Archives: Military Records Archives: Office of the Inspector General (OIG) Archives: Online Exhibits National Archives: Open Government at the National Archives National Archives: Preservation Archives: Presidential Libraries National Archives: Press/Journalists National Archives: Prologue Magazine National Archives: Publications Archives: Records Management National Archives: Research Archives: Shop National Archives: Teacher’s Resources National Archives: Veterans National Archives Museum National Commission on Terrorist Attacks upon the United States National Commission on Terrorist Attacks upon the United States Financial Crisis Inquiry Commission Federal Register National Archives The Office of the Federal Register The Our Documents Initiative The Presidential TimelineBlogs National Archives: Blogs National Archives Blog: Annotation / NHPRC National Archives Blog: AOTUS Blog National Archives Blog: The Carter Chronicle National Archives Blog: Education Updates National Archives Blog: FOIA Ombudsman National Archives Blog: Inside Innovation National Archives Blog: Media Matters National Archives Blog: NARAtions National Archives Blog: National Declassification Center National Archives Blog: Prologue: Pieces of History National Archives Blog: Records Express National Archives Blog: Rediscovering Black History Blog National Archives Blog: The Text Message National Archives Blog: Transforming Classification National Archives Blog: The Hoover BlackboardPresidential Libraries Herbert Hoover Presidential Library & Museum Franklin D. Roosevelt Presidential Library and Museum Harry S. Truman Library & Museum Dwight D. Eisenhower Library, Museum, and Boyhood Home John F. Kennedy Presidential Library and Museum LBJ Presidential Library Nixon Presidential Library & Museum Gerald R. Ford Presidential Library & Museum Jimmy Carter Library & Museum Ronald Reagan Presidential Library & Museum George Bush Presidential Library and Museum William J Clinton Presidential Library & Museum George W. Bush Presidential Library and MuseumWeb Sites Not Included:The following web sites are not included because they not a presidential library and they are not contained within the “” web domain:The Federal Register Blog: Archives Wiki: MetadataThe Catalog crawls the website, the associated blogs and presidential libraries mentioned above, starting with the site root or specified start page. A set of metadata is extracted from each of the pages. Then the each page plus its metadata is indexed into the Catalog.Note:The “WEB/” prefix is used to identify all metadata fields which come from the Web crawler or are extracted from the HTML page. See section REF _Ref381045302 \r \h 1.3 above for more details.The metadata to collect from each web page is listed below:FieldTypeFromDescriptionWEB/typestringMapping Table“archivesWeb” or “presidentialWeb”WEB/urlstringCrawlerURL of the web page.WEB/mimestringText extraction or HTML pageEither “text/html” for all web pages, or the mime type as determined by text extraction.WEB/titlestringHTML PageThis is the title of the webpage or the blog post. This metadata will be displayed to the user. If no title is found, the web area may be used instead, or for a blog, the blog title (title tag of blog web page).WEB/areastringURL Parsing and Mapping TableThe area of the web site to which a web page belongs. For example, with presidential libraries, this is the title of the library web site. For a blog, it is the title of the blog (AOTUS).Also, some sub-areas within (such as “National Archives: Teacher’s Resources”) will be separately identified.WEB/areaUrlstringMapping TableThe URL for the web area, such as the home page for the library.WEB/datestringURL or HTML PageThe date of the web page, if identifiable.WEB/contentstringHTML PageContent of the web page. Note that some web pages may have a descriptive tag to identify content. If this does not exist, it will be all of the non-tag text.Note that not every web site or web page will have all the above metadata. At a minimum the following are needed:WEB/typeWEB/titleWEB/areaWEB/areaUrlWEB/contentMetadata Parsing for the SiteThis section provides information on how the standard metadata may be obtained, followed by some examples. For the title, different tags may be appropriate if the main <title> tag is either missing or not informative for a subsection of .FieldTags / Mapping / Parsing InstructionsWEB/type“archivesWeb”WEB/urlURL of web page (from crawler).WEB/mime“text/html”WEB/titleThe title will be extracted from the any of the following tags:<title><meta name=”description”><h1>URL file nameThe tags are in priority order. For example, if no title can be extracted from <title>, it will be extracted from <meta name=”description”>.WEB/areaThis will come from the “Web Area” column of the largest URL from the table in section REF _Ref381045511 \r \h 2.1 above which matches the prefix of the WEB/url field from the crawler.For example, the URL “” matches “” from the table in section REF _Ref381045511 \r \h 2.1. Therefore, the “WEB/area” field will be set to:“National Archives: Office of the Inspector General (OIG)”.WEB/areaUrlThis will come from the “Web Area” column of the largest URL from the table in section REF _Ref381045511 \r \h 2.1 above which matches the prefix of the WEB/url field from the crawler.WEB/dateFrom the <meta name=”date”> tag. Parsed as YYYY-MM-DD.WEB/contentIf both <!-- startindex --> and <!-- stopindex --> exist:Index all text content between the two tagsIf only <!-- startindex --> exists:Index all text content from startindex to the end of the web pageOtherwiseIndex all text content on the web pageNotes:Only index text content. Do not index HTML tags, HTML attributes, or HTML attribute values.Ignore all content found between <script> and </script>Do not index the content of <!-- XML comments -->See below for an example.Example of startindex & stopindex:This example is from the top level landing page at . <head> <title>Teachers' Resources</title> <meta name="description" content="Resources for Teachers" /> <meta name="date" content="2013-08-21" /> ...<body><div id="container"> <div id=”header”>...<div id="container2"><div id="content"><!-- startindex -->.. Content to index occurs in here.<!--stopindex--><div class="menu connect"> ...</body></html>Metadata Parsing for NARA BlogsThe majority of the NARA blogs have a consistent format. The exception is the AOTUS blog. The following table shows how to obtain the standard metadata.FieldTags / Mapping / Parsing InstructionsWEB/type“archivesWeb”WEB/urlURL of web page (from crawler).WEB/mime“text/html”WEB/titleFor all blogs, the <title> tag will supply the title. However, all blog titles contain both the title of the blog as well as the title of the article. For example:AOTUS: Collector in Chief | Calling All Walt Whitman FansTo improve accuracy, we will remove the title of the blog from the above, leaving only “Calling All Walt Whitman Fans”. The title of the blog itself will be captured in the WEB/area field below.For the AOTUS blog, extract the blog article title from the text after the “|” pipe character.For all other blogs, extract the blog article title from the text after the “&raquo;” character.If the delimiter specified above does not occur in the title, take the entire <title> content as the title.WEB/areaThis will come from the “Web Area” column of the largest URL from the table in section REF _Ref381045511 \r \h 2.1 above which matches the prefix of the WEB/url field from the crawler.WEB/areaUrlThis will come from the “Web Area” column of the largest URL from the table in section REF _Ref381045511 \r \h 2.1 above which matches the prefix of the WEB/url field from the crawler.WEB/dateParse the date from the text of the blog:AOTUS: <p class="meta">Written on February 14, 2014 |Other blogs: <span class="gray">on February 3, 2014</span>See examples.WEB/contentThere appears to be no standard for marking the content area of blogs. Therefore, the Use the entire text content of the HTML page will be indexed as the content.Notes:Only index text content. Do not index HTML tags, HTML attributes, or HTML attribute values.Ignore all content found between <script> and </script>Do not index the content of <!-- XML comments -->Parsing Date for AOTUS BlogDates for this blog can be identified with <p class="meta"> as shown below.<h2><a href="..">Happy Valentine’s Day</a></h2><p class="meta">Written on February 14, 2014 | <a href="…"><p><a href="..."><p>This 1918 valentine refers to the World War I effort…</div>-->Metadata Parsing for other National Archives BlogsDates for other blogs occur after the <span class="gray"> tag.<div class="post"><div><img alt="..."><h2><a href="... title=”National Declassification Center Completes Quality Assurance of Backlog Final">National Declassification Center Completes Quality Assurance of Backlog Final</a></h2><div class="small">by <a href="...">Nancy Soderberg</a> <span class="gray">on February 3, 2014</span></div> </div></div></br /><br />Presidential LibrariesThe page format of these libraries is unfortunately not consistent. The following strategy provides a set of tags to check to obtain the metadata. In some cases, the library web pages’ title tag is just the official name of the library and is the same for the various pages. In this case, if possible, a different tag should be used to obtain the title of the specific page.FieldTags / Mapping / Parsing InstructionsWEB/type“presidentialWeb”WEB/urlURL of web page (from crawler).WEB/mime“text/html”WEB/titleThe title will be extracted from the any of the following tags:<title><h1><h2><meta name=”description”><meta name=”keywords”>URL file nameThe tags are in priority order. For example, if no title can be extracted from <title>, it will be extracted from <meta name=”description”>.WEB/areaThis will come from the “Web Area” column of the largest URL from the table in section REF _Ref381045511 \r \h 2.1 above which matches the prefix of the WEB/url field from the crawler.For example, the URL “” matches “” from the table in section REF _Ref381045511 \r \h 2.1. Therefore, the “WEB/area” field will be set to:“Harry S. Truman Library & Museum”.WEB/areaUrlThis will come from the “Web Area” column of the largest URL from the table in section REF _Ref381045511 \r \h 2.1 above which matches the prefix of the WEB/url field from the crawler.WEB/dateFrom the <meta name="date"> tag. Parsed as YYYY-MM-DD.If the tag does not exist, then leave the date blank.WEB/contentIf both <!-- startindex --> and <!-- stopindex --> exist:Index all text content between the two tagsIf only <!-- startindex --> exists:Index all text content from startindex to the end of the web pageOtherwiseIndex all text content on the web pageNotes:Most presidential libraries do not use this convention.Only index text content. Do not index HTML tags, HTML attributes, or HTML attribute values.Ignore all content found between <script> and </script>Do not index the content of <!-- XML comments -->Processing Non-HTML contentAll non-HTML files will be processed as follows:All media files (video, images, audio) – SkippedDocument files (PDF, MS-Word, etc.) – Will be processed through text extractionSee belowText ExtractionAll document files (PDF, MS-Word, etc.) will be run through text extraction to extract metadata and text content.The tool for text extraction (depending on the results of the Analysis of Alternatives, currently in progress) will likely be Apache Tika, see for more information.Metadata Mapping for Non-HTML ContentThe output of Apache Tika will be mapped to the WEB/ fields as follows:FieldTags / Mapping / Parsing InstructionsWEB/typeIf the URL contains “” this will be “archivesWeb”.Otherwise it will be “presidentialWeb”.WEB/urlURL of web page (from crawler).WEB/mimeFrom <meta name="Content-Type" content=". . ."/>WEB/titleThe title will be extracted from the any of the following tags:<title><h1>File nameThe tags are in priority order. For example, if no title can be extracted from <title>, it will be extracted from <h1>WEB/areaThis will come from the “Web Area” column of the largest URL from the table in section REF _Ref381045511 \r \h 2.1 above which matches the prefix of the WEB/url field from the crawler.WEB/areaUrlThis will come from the “Web Area” column of the largest URL from the table in section REF _Ref381045511 \r \h 2.1 above which matches the prefix of the WEB/url field from the crawler.WEB/dateThe date will come from any of the following tags:<meta name="Last-Modified" content=". . ."/><meta name="date" content=". . ."/><meta name="Creation-Date" content=". . ."/>The tags are listed in priority order. The date will be found in the @content attribute. The date will be in ISO 8601 format.If no date exists, then leave the date blank.WEB/contentAll text content produced by Apache Tika (minus XHTML tags and XML comments) will be indexed as the text content.File ProcessingBeyond the metadata parsing described in section REF _Ref381108731 \r \h 2, there is no file processing required for the Web Sites data source.This section is retained to ensure that major section numbers are consistent across all Data Model Design (DMD) documents.Mapping to Index Fields The following table maps the metadata to the appropriate index field. The metadata elements are not given here because in some cases they differ based on the particular web site they originate from. See the mappings of document tags to metadata given in section REF _Ref381051516 \r \h 1.3.Field Metadata MappingsIndex FieldWEB Metadata NamePurposeI/urlWEB/urlresultsI/source“web”searchI/typeWEB/typesearchI/oldScopeIf WEB/type == “archivesWeb”: set to “”If WEB/type == “presidentialWeb”:set to “presidential”search, facetI/iconTypeWEB/mime (normalized)resultsI/fileFormatWEB/mime (normalized)resultsI/originalMimeTypeWEB/mimeresultsI/tabType“all,online,web”resultsI/materialsType“web”resultsI/titleWEB/titleresults, searchI/titleSortWEB/title with articles and prepositions from the start removed.sortingI/allTitlesWEB/titleresults, searchI/webAreaWEB/arearesultsI/webAreaUrlWEB/areaUrlresultsI/contentWEB/contentresults, search, teasersI/teaserThe first 500 characters of WEB/contentresultsI/titleDateWEB/dateresultsI/dateRangeFacetChoose the appropriate date range facet which covers the WEB/date value. Leave blank if WEB/date is blank.facetsI/productionDateWEB/dateresults, sortingI/productionDateQualifier“YMD”resultsRefining by location should only return archival records, not web page results.Mapping to Keywords Relevancy ModelWeb page metadata will be mapped to the Catalog relevancy model as follows. Note that all of the fields specified (grank1, grank2, grank3, and content) will be searched by all “q=” parameters.When multiple WEB/ fields are mapped to the same relevancy field, all of their content will be concatenated together into the same field and searched together.WEB/ fieldRelevancy FieldWEB/titleI/grank2WEB/urlI/grank3WEB/webAreaI/contentWEB/contentI/contentSearch FieldsThere are no special fields for advanced search over web sites.Search Results PresentationThis section details the search results presentation for web site results.For simple keyword searches, the top three web sites matches will be returned. They are returned as a group in the search results list, under “National Archives website pages”.Search Results (aka “Brief Results” Display)As shown in this example, only the top 3 website results are shown as grouped under the heading “National Archives website pages”. The globe icon is used.This table describes the 3 lines that make up each web site within the group.LineSoLR Fields & Pattern1{I/title}LINK: I/url2{I/webArea}LINK: I/webAreaUrl3{Highlighted teaser from the search engine, or I/teaser }[Note: Display first 200 characters as snippet in the result]Notes:“National Archives website pages” links to a tab with all the web sites search results.The globe icon is a link to the same tab.Content Details PresentationThe links in the search results go directly to the external website and therefore there is no content detail page for within the Catalog. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download