Introduction and Overview - National Archives

 NARA’s National Archives Catalog (NAC)2473325-390524Detailed Design Document (Final Version) Version 2.1April 26, 2017Prepared for:National Archives and Records Administration (NARA)Prepared by:1760 Old Meadow RoadMcLean, VA 22102114300375920Copyright ? 2017, Project Performance Company, LLC.All Rights Reserved. No part of this document may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the Project Performance Company, LLC.114300375920Record of Changes/Version HistoryChange/Version NumberDate of ChangeSections ChangedDescriptionPerson Entering Change1.012/6/2016AllInitial DraftJohn Henson1.112/30/20162.1.4, 2.1.9, 2.1.10, 2.2.1, 2.2.2, 2.2.3, 2.2.5, 2.2.6,2.3Updating after first round of feedback.John Henson1.21/5/2017Removed 2.2.1 Expanded Field Functionality,Updated section 2.3Updating after discussions with NARAJohn Henson1.33/20/2017Updating All sections with more detail. Updating API section with new field operators and clarification.John Henson1.43/24/2017Changed references to DAS Whitelist to API field name whitelist.Clarified purpose of API field name whitelist in section 2.2.6, and its contents in section 2.4.2.John Henson1.53/28/2017Added details regarding Lambda/S3 challenge. Updated Lambda APS polling informationJohn Henson1.64/24/20172.1.32.2.2Updated Lambda/S3 information with Lambda APS solution.Explained the new default Solr sortingJohn Henson Table of Contents TOC \h \u \z 1Introduction and Overview31.1Digital Object Processing31.2REST API Modifications and Enhancements31.3Catalog Search Engine Optimization41.4Challenges41.5Development Approach and Environments42Detailed Functional Design and Implementation52.1NAC Digital Object Processing52.1.1Digital Object Processing Procedure52.1.2New Object Key Format52.1.3AWS Lambda62.1.4Amazon S3 Triggers62.1.5Image Format Conversion72.1.6Risk Mitigation Strategy72.1.7Statistics Tracking72.1.8Scope82.1.9Data Migration Strategy82.1.10S3 Storage Costs82.2REST API Modifications and Enhancements92.2.1Solr Schema Updates and Re-indexing Operation92.2.2API Limit Refactoring102.2.3Comment IDs in Public Contribution Search Results112.2.4Exact Search112.2.5Find Records With/Without Field Values122.2.6Field Whitelist122.3Catalog Search Engine Enhancement132.3.1Refactor Link URLS132.3.2Include Additional Metadata132.3.3Paginated List of All NAID Links132.4Assumptions142.4.1NAC Ingestion Delete functionality142.4.2API Field Name Whitelist143Security143.1Digital Object Processing143.2REST API Modifications and Enhancements153.3Catalog Search Engine Optimization154Appendix A16Introduction and OverviewThis Task Order (TO) addresses issues concerned with three components within the NAC infrastructure: Digital Object Processing, the publicly accessible RESTful API, and the Catalog public User Interface. Digital Object ProcessingThe existing DAS Export ingestion process is untenable given the expected growth of the data. The current process puts both metadata preparation with digital object processing in the critical path causing significant delays in overall processing time. While the existing Aspire application is designed to perform concurrent tasks and processes multiple data elements simultaneously, that capability has still proven to be too limited. To solve this problem, PPC proposes the use of Amazon Web Services (AWS) Lambda for digital object processing, with a fallback system for extra-large files (over 200Mb) and for files that otherwise might cause an error during Lambda processing. This fallback system will also be used manage the upload of the many derivative Deep Zoom image tiles to S3. There are other benefits to this solution that will have a significant impact to overall NAC and DAS operations and cost.REST API Modifications and EnhancementsThe current Catalog API implementation has proven to be too limited and only supports a subset of expected functionality. There are various fields that, while indexed, are not available for operations like sorting. In order to permit exact text matching on nested fields within Description, Authority, Objects, and Public Contributions XML, PPC will reindex the entire set of data, creating new fields in the Solr Schema for each potential leaf node, and for each leaf node and various attributes, indexing not only the value of the node or attribute, but also the combination of the two. This will also allow for testing for the existence of a leaf node, or for parent nodes as defined in the DAS white list. Additionally, the current implementation of the API does not permit the full traversal of an entire search result set for a given query. The logic dictating what the limits are and how they are imposed is flawed. Currently, there is a limit configured to 10,000 records. This limit is the absolute number of documents into a set a user can retrieve. Excluding sorting, ordering and various other operations, if a result set has 400,000 documents in it, only the first 10,000 records can be retrieved of that set. The reason for this is that the limit is calculated as the sum of the offset (how deep into the set you want to go; or where the results start) and rows (the number of records you want in a given request) parameters. The expected behavior is that this limit be imposed on the number of rows requested, such that one can traverse the entire result set, up to--in this case--10,000 records at a time. PPC will correct this issue, and additionally implement a cursor-based paging mechanism that is optimized for traversing large datasets. This change will not affect the behavior of the Catalog User Interface and its limits.Next PPC will index the Annotations database IDs of public comments provided through the Catalog UI to the Search Index and output.Catalog Search Engine OptimizationIn its current implementation, the Catalog UI relies heavily on dynamically generated links that are, for example, composed when a user invokes a JavaScript function by clicking what appears to be a regular link. (The act of which forces the browser to redirect to the newly generated URL.) The problem with this is that the links are not followed by the major search engine (Google, Bing, Lycos, etc.) crawlers. To solve this, the Catalog UI--and subsequently the API--will be refactored to generate static URLs that can be indexed by the search engine services.ChallengesThe new fields in the Solr schema, and the addition public contribution IDs, will require a complete DAS export and ingest operation to reindex all the data.There are limits to AWS Lambda Functions. Specifically, there is a soft limit of 100 simultaneous running Lambda Functions per second. There is also a limit to the amount of disk space each function execution has to work with: 500Mb. That means that a file greater than 200Mb, could result in derived objects (e.g. tiled zoom images) taking up more space than available.The nature of the artifacts presents an upper limit to how many AWS Lambda Image Processing Functions we can run simultaneously. AWS S3 has a hard limit (300) on the number of PUT operations per second, and the number of image tiles generated per image function is in the hundreds. Development Approach and EnvironmentsDuring AWS Lambda Function development, all code will be written, built, and maintained on PPC-provided resources, and deployed to the AWS Environment under the NAC AWS Account. The “Lambda Alternate Processing Server (APS)” will be provisioned by managed services vendor, and configured by PPC.Catalog API, Solr server, and UI development will be performed on PPC-owned resources, with unit and integration testing occurring in the existing NAC Development environment.All source code will be maintained in PPC's private GitHub repository. Digital Object processing modules will be provided to NARA to be published for consumption by the general public.To satisfy Req # 1356: “The system shall search at a minimum Threshold rate of 10 Queries per Second (QPS) per Node.”, we will perform performance testing using standard test tools such as Apache JMeter, and deliver a corresponding test report.Detailed Functional Design and ImplementationNAC Digital Object ProcessingAWS Lambda, combined with Amazon S3 triggers, will permit the parallel processing of digital objects as they are uploaded to S3. The primary advantage over the existing implementation is the true parallel processing capability and immediate availability of the digital objects and their derived renditions. There will be at least two Lambda Functions: one to extract text and metadata, and paginated PDF text, and one to generate thumbnails and Deep Zoom tiles. Digital Object Processing ProcedureThe immediate advantage of this is the elimination of unnecessary redundant S3 storage. In the existing process, DAS users upload data to a staging S3 bucket where the NAC Ingest operation retrieves the data for processing and ultimate placement in a separate S3 bucket. There is no explicit removal of this data from the originating bucket.The proposed approach allows NARA staff to place digital objects into a pre-designated S3 bucket and “Landing Zone” folder for permanent storage, negating the need for intermediate storage. Going forward, this pattern will yield significant cost savings for NARA. Additionally, digital object management is greatly simplified: both DAS and NAC have the same paths and filenames.A series of steps must occur for a digital object to be made available in the Catalog:1 NARA Staff uploads Digital Object files to the specified production “landing zone”2 NARA Staff creates DAS Descriptions referencing the public URL for the newly uploaded Digital Object files3 A DAS Delta Export is performed as scheduled4 A NAC Ingestion operation is performed on the DAS export5 NAC Ingestion reaches back to S3 for Digital Object technical metadata and extracted text 6 NAC Ingestion submits the record including technical metadata and extracted text to Solr for indexing7 Once indexed by Solr, DAS Descriptions and associated Digital Objects are visible in the CatalogNew Object Key FormatTo facilitate this new process, S3 object keys will be modified from the current NAID-based directory structure to using the digital object path and file. This information is already available in the Description metadata, and stored in the existing Solr index. As an example, assuming the new Bucket is called “NARAStorage,” and the Landing Zone folder is called “lz,” and the final folder for renditions is called “live,” a NARA DAS user will upload a file called image01.jpg with path of /dcmetro/stillpix/002/images/ to the NARAStorage bucket with key prefix “/lz”. The full object key will be "/lz/dcmetro/stillpix/002/images/image01.jpg", and will be used as the base key for all derivative renditions: thumbnails, for example, will automatically be saved and referenced in the “NARAStorage” S3 bucket at "/live/dcmetro/stillpix/002/images/image01.jpg/opa-renditions/thumbnails/image01.jpg-thumb.jpg." This change will be seamless to the end user.If for any reason a digital object needs to be updated, uploading the file to the S3 bucket as part of the normal process will trigger the AWS Lambda functions, and the derived renditions will be regenerated and overwrite the previous versions, making the update immediately available in the Catalog.AWS LambdaAWS Lambda is a high-availability, highly scalable "serverless" compute infrastructure. For nearly all cases, AWS Lambda functions will handle Catalog digital object processing.PPC proposes the development of at least two Lambda Functions: one Lambda function for text extraction, and one for image processing. The primary reason for two separate functions is the fact that the ideal library for text extraction is a popular Java library called Apache Tika, and the image processing program VIPS is already available as a prepackaged NodeJS module called Sharp. Therefore, the text extraction function will be a Java function, while the image processing a NodeJS function.The text-extraction function will process each digital object, parsing each file for textual information and producing a text file for placement in S3. If the object is a PDF file, an additional paginated text file will also be generated. In addition to text extraction, the text-extraction Lambda function will produce a native Java MetaData object, which will be serialized and stored in S3 alongside the text files, and subsequently retrieved and deserialized by the Catalog ingestion process for indexing in Solr.The image processing function will perform three primary functions: JPEG2000 conversion, thumbnail creation, and Deep Zoom tile creation. As previously described, these artifacts will be placed in S3 with a base object key of the originating digital object.The image processing Lambda function, when run in the AWS Lambda environment, will produce the Deep Zoom image tiles, but rather than uploading those images directly to AWS S3, the function will package the images in a zip archive and upload this file to S3. A corresponding message will be queued for downstream processing by the Lambda Alternate Processing Server, which will retrieve the zip file from AWS S3, extract the images and upload them to S3 individually in a manner that will avoid the S3 PUT limits (300/second), and finally remove the zip file.These AWS Lambda Functions will be configured for the highest memory option available, 1.5Gb at the time of this writing. AWS Lambda scales horizontally with the amount of data being uploaded to S3, with a soft limit of 100 concurrent processes running at any given second. The limit may be increased by AWS upon request, with relevant statistics and justification. Performance and load testing will be performed, and inform a decision on whether or not to raise this limit, and by how much if necessary.Amazon S3 Triggers Because of constraints imposed on the configuration of the S3 triggers, the text extraction function will execute for all uploaded objects and call the image processing Lambda function via and AWS Simple Notification Service (SNS) message for all object types supported. Supported formats include JPEG2000, JPEG, GIF, PNG, BMP, PDF, and TIFF.To save on costs, all original objects uploaded to the S3 Landing Zone should be classified as Standard Infrequent Access upon upload. This is not an automatic classification, and would need to be specified in the configuration of the tool being used to perform the upload.Image Format ConversionDue to limitations of the libraries and/or infrastructure, certain formats must first be converted to JPEG format before derivative artifacts can be generated. Formats to be converted are JPEG2000, and BMP. JPEG2000 and BMP images are first converted to JPEG, the result of which is then processed normally as a result of being placed in the S3 Digital Object Landing Zone. Risk Mitigation StrategyIn the few cases where the digital object is too large to load into the Lambda Function memory or temporary disk area, a separate processing mechanism will be implemented. The Lambda Alternate Processing Server (Lambda APS) will be provisioned by the managed services vendor and configured with a custom framework that incorporates the Lambda function code. Lambda Functions that encounter "extra-large" objects or fail for other reasons will place the digital object trigger event information into an AWS Simple Queueing Service (SQS) queue. At the top of every hour, a polling script on the Catalog “jump” host will start the host when the queue reaches a preconfigured length. (By starting at the top of the hour, NARA saves on EC2 costs.) The APS will automatically retrieve the items from the queue, processing them as long as there are items on the queue. When the queue is empty, the server will shut itself down to save on EC2 costs.Lambda APS shutdown is configurable via a policy file stored in the same secure private AWS S3 bucket where the Lambda functions themselves, and other related artifacts are stored: nara-nac-lambda-functions.Three new Amazone EC2 instances will be provisioned for the Lambda APS, one for each NAC environment: development, UAT, and production. In keeping with the existing naming scheme, they will be called dl01, ul01, and pl01. Statistics TrackingAWS Lambda logs all statements in CloudWatch Logs. Pertinent Digital Object processing information will be logged for each Lambda Function invocation and stored in CloudWatch. PPC will install the Splunk App for AWS in the NARA Catalog Splunk server, which will offer greater performance tracking and analytical capabilities, and improve troubleshooting-related activities.The Lambda APS will be configured to forward its logs to Splunk for the same tracking capability.ScopeThe impact of the AWS Lambda-based Digital Object Processing on the rest of the system is minimal, but does affect several key components. To accommodate the new S3 object key format, the Catalog API will need to be updated so that calls from the UI are given the correct artifact URLs. Since the digital object processing stages are no longer occurring in the NAC Ingest Operation, the stages related to these activities will be removed from the Ingestion Pipeline configuration. Additionally, the Text Extraction stages will be modified to reach back to S3 to look for the already-generated text files so that they can be indexed in Solr.Data Migration StrategyTo account for the already existing data, an ETL process will be developed to copy the data and renditions to the new S3 bucket using the new object key format. Once complete, the existing bucket can be placed in AWS Glacier for safekeeping during a short validation period. After which the decision on whether to keep the data can be made.All copied source files will be stored with Standard Infrequent Access S3 storage class.The anticipated length of time to migrate the data from the current object key format to the new format is in the order of months. To account for this, the API will be configured to check both key formats in S3 for digital objects and format the results accordingly. For new object keys, and migrated keys, the URL to access these digital objects will change from: Catalog web server configuration will include the following rewrite rule that will redirect requests to AWS S3:RewriteRule "^/catalogmedia/(.*)" "$1" [L,R=302] S3 Storage CostsAt the time of this writing, monthly cost for the Standard Storage Class of the Catalog’s roughly 84 Terabytes of objects is, according to the AWS Calculator, $2,772.48 per month. By converting the storage class of these objects to Standard Infrequent Access (IA), the monthly cost of storage becomes $1,182.72. This represents a cost savings of over 50%. All derivative objects, such as thumbnails and tile images, will be stored as Standard storage because they are generally below the minimum size limit (128Kb) for Standard Infrequent Access storage class.REST API Modifications and EnhancementsSolr Schema Updates and Re-indexing OperationThe expanded feature set proposed for Task Order 4 requires significant changes to the Search Index and Catalog Ingestion processes. The current XML fields (description, authority, publicContributions, objects) in the Solr index are not configured for exact-match text search, nor do they allow for efficient field detection. In order to allow exact searches of the fields in the API field name whitelist, new fields must be configured for Solr that index the value of the leaf nodes and the paths to those leaf nodes.Example: Considering the following sample document:<series> <naId>98765432</naId> <title>President Hoover</title> <accessRestriction> <accessRestrictionNote>Access Granted</accessRestrictionNote> </accessRestriction></series>The document would be flattened to provide a key/value pair for each data element:series.naId=98765432series.title=President Hooverseries.accessRestriction.accessRestrictionNote=Access GrantedThe new top-level field names will be based on the last fragment in the XML path. So, for series.accessRestriction.accessRestrictionNote field, we will derive a new top-level field name called “ex_accessRestrictionNote” in the Solr Schema configuration.When indexing this field, we will use a special Solr Tokenizer that will index not only the text value of this field, but also the original path to this value in the original XML document.Example:series.accessRestriction.accessRestrictionNote=Access Grantedbecomes in Solr: ex_accessRestrictionNote:”{description.series.accessRestriction.accessRestrictionNote}Access Granted”Another example:Description.series.title=President Hooverbecomes in Solr: ex_title:”{description.series.title}President Hoover”The indexed paths will include the paths to the leaf nodes and attributes, and any intermediate paths that are also listed in the API field name whitelist.Since TO4 involves the removal of Digital Object Processing from the Ingestion Pipeline, in order to preserve previously generated object-related metadata including file size, the existing Objects XML will be retrieved from S3 and incorporated into the records as they are processed. If it cannot be found in S3, an attempt will be made to retrieve the Objects XML from the Solr index. If it cannot be found in either of those locations, the ingest procedure will build the Objects XML from the DAS record, but the technical metadata will not be generated, and it will not be indexed. Because the production environment should contain the latest processed Objects XML for each DAS record having Digital Objects, the chance of this occurring is very small.API Limit RefactoringTo accommodate unlimited traversal through a query set, the existing “offset+rows” limit will be removed. In its place, PPC will implement a new row limit, that will limit the number of rows retrieved per request, regardless of the offset value. In this way, the only time a user should receive a limit error is when they attempt to request more rows than the configured limit allows. This limit is currently set to 10,000 for the “offset+rows” limit, and PPC intends to configure the new row limit to this figure.In addition to the traditional row count, offset pagination mechanism, PPC will expose a feature of Solr that improves “deep paging” of large datasets. This is a cursor-based pagination mechanism that requires the client to retrieve, store, and use a token provided by the server to retrieve the next set of rows in the dataset.Solr Example:Initially query the Solr Index with “cursorMark=*”, and “rows=1000”:[jo+TO+k]&sort=titleSort+asc%2C+opaId+desc&rows=1000&wt=json&indent=true&cursorMark=*The result set will contain a field called “nextCursorMark” with a long encoded string value. In this example, nextCursorMark is “AoIzam9obiBjaGFybGVzIHRob21hcytkZXNjLTExMzMxOA==”. To retrieve the next 1000 records, PPC would modify the query, replacing the value for “cursorMark” with the value assigned to “nextCursorMark” from our first set of results:[jo+TO+k]&sort=titleSort+asc%2C+opaId+desc&rows=1000&wt=json&indent=true&cursorMark=AoIzam9obiBjaGFybGVzIHRob21hcytkZXNjLTExMzMxOA==PPC will refactor the Catalog API code to expose this feature and update the Catalog API documentation.Pagination performance is dependent on the offset (start) and rows values and we plan to limit the rows value, but not the offset.? As the offset increases, so does the demand on the Solr servers. If a user intends to set a very high offset, their intent is probably not to find relevant documents, but to collect the entire search set for some other purpose.? For those cases, cursorMark deep paging is ideal. PPC proposes modifying the timeout messages to instruct the user to narrow their search or consider the cursor-based search. The cursor-based paging mechanism and traditional pagination are mutually exclusive and cannot be combined. If a user attempts to make a request using the cursorMark parameter, while also including the offset parameter, they will receive and error.To accommodate the Solr cursor mark capability, the cursorMark implementation requires a unique field to be specified in the Solr sort parameter for tie-breaking.? The default sort for regular searches is by score (or relevance) in descending order.? For records with matching scores, results are sorted by index time in descending order--more recently indexed record is listed first. ?Since we want to get as close to the default sort order, PPC will configure the cursorMark sort field list so that records are sorted first by score in descending order, then by ingestedDateTime in descending order, and then finally by opaId in ascending order.? IngestedDateTime is not the same as the index time, as ingestedDateTime is populated before the records are sent to Solr for indexing, and records are not fully indexed in the same order that they are ingested for a variety of reasons.To account for this, PPC will explicitly set the default sort in Solr to match the cursorMark implementation, so that result sets are identical between both cursorMark and non-cursorMark requests.? This does affect all Catalog search queries without an overriding sort ment IDs in Public Contribution Search ResultsDuring the Catalog Ingestion process, publicly-provided comments are combined with DAS Export metadata information to be indexed in the Search Index. PPC will expand the information retrieved from the Catalog Annotations MySQL database to include the IDs for each comment, and include this information in the publicContributions field information and additionally create a new multivariable field definition called “commentsIds” to the Solr schema.xml file.To accomplish this, PPC will update the MySQL Stored Procedure “spIngestionGetComments” to return the “annotation_id” from the “annotation_comments” table for Descriptions and Objects that have comments associated with them. The Catalog Ingestion process will be updated to incorporate the IDs into the publicContributions XML. For example: <publicContributions><comments total=\"2\"><comment id=”26465” created=\"2015-09-10T16:56:14Z\" isNaraStaff=\"false\" user=\"cacaouette\">test reply\nediting reply</comment><comment id=”5729933” created=\"2015-09-10T16:56:50Z\" fullName=\"Anne E Caouette\" isNaraStaff=\"true\" user=\"acaouette\">test comment\nediting comment</comment></comments></publicContributions>A separate Solr index field called “commentIDs” will be added to the Schema to support searching by Comment ID.Exact SearchPPC will introduce two new operators, or “field decorators”, to the API. They are “_is” and “_not”. By appending “_is” to the field name, the results should only include those records with the parameter value matching the provided string. By appending “_not” to the field name, the results should only include those records without the parameter value matching the provided string.Example:A request for the following URL would return only records that have the field description.item.accessRestriction.specificAccessRestrictionArray.specificAccessRestriction.restriction.termName set to “Donor Restricted”, appending “_not” to the field name will return all documents that do not match the provided string for that field. either case, the API builds a query that checks against the Solr field “ex_termName” for the value “{description.item.accessRestriction.specificAccessRestrictionArray.specificAccessRestriction.restriction.termName}Donor Restricted”Find Records With/Without Field ValuesTo enable the ability to find records with or without specific fields in the API field name whitelist, PPC will implement two new parameters: “exists” and “not_exist”. Each parameter will accept a comma-separated list of field names. The “exists” parameter will cause the API to return records that contain the submitted field names. The “not_exist” parameter will cause the API to return records that do not contain the submitted field names. For the “exists” operation, the API will convert the request into a Solr filter query string value:For exists=path(field1), path(fieldn), …, path(fieldn+1)fq=ex_field1:”{path(field1)}” AND ex_fieldn:”{path(fieldn)}” … AND ex_fieldn+1:”{path(fieldn+1)}”For not_exists=field1,fieldn,…,fieldn+1fq=NOT(ex_field1:”{path(field1)}” AND ex_fieldn:”{path(fieldn)}” … AND ex_fieldn+1:”{path(fieldn+1)}”)Field WhitelistThe list of fields in the Catalog API is explicitely defined in a API field name whitelist file. This list currently contains over 916 thousand different Authority, Description, Public Contribution, and Object paths that are permitted for searching. For the purposes of this Task Order (TO4), the reindexing operation will use this list to determine all expected intermediate and leaf node paths to be searchable with the “exists” or “not_exist” operators, and the exact search “_is” and “_not” field “decorators”. For intermediate paths, there is no immediate text value associated with them, even though their child, or grandchildren nodes may have data indexed.If a field being searched using the “exists” or “not_exists”, or exact search “_is” and “_not” field decorators, and that field is not in the API field name whitelist, the API will return an INVALID_PARAMETER error in the results indicating the invalid field name submitted. Existing functionality such as regular searches, refining, exporting, etc. will continue to use this whitelist to validate field entries.Sections 2.4.2 elaborates on the assumptions about the API field name whitelist, and Appendix A contains a reference to the API field name whitelist.Catalog Search Engine EnhancementRefactor Link URLSThere are instances in the Catalog UI HTML where links are constructed on-demand using JavaScript. Also, there are links that are constructed at request time by the Catalog “internal” API. In the first case, where possible, the Catalog UI code will be rewritten to produce a static URL at render-time, rather than the time the user clicks the link. While major search engine crawlers have built-in JavaScript renderers, In the second case, the links are occasionally constructed using fragment identifiers, which is a common technique in modern “single-page” applications. Unfortunately, most major search crawlers will ignore fragment identifiers. An example of this pattern can be seen by observing the links constructed in the “Record Hierarchy” section of the Catalog Content Detail page for a given record. For example, currently the link for Record Group 174 is provided to the browser as “”, however, is entirely sufficient, works as expected, and a more accurate representation of the resource identifier.Include Additional MetadataThe HTML Meta Description and Meta Keywords tags will be populated with the “Scope and Content” field data present for a record. The Description field will be displayed on search engine results underneath the listing link, and the search indexer will use the content in the Keywords field to associate relevant terms with a record in its index.Paginated List of All NAID LinksIn addition to updating the Catalog UI, PPC will implement a new, unpublished, Catalog API endpoint that produces a simple listing of links to each record in the Catalog. This service will produce a simple HTML page with a list of links to sequential Catalog Content Detail pages for each NAID. At the bottom of each of these pages will be a link to the next n number of records. The HTML pages will also contain Meta tags instructing search crawlers not to index the list of links.To initiate the crawl, the major search engines would be seeded with the initial page at URL . The Catalog web servers will be configured to redirect requests for, for example, to , where 1 is the page number.The Catalog web server httpd configuration will incude a new RewriteRule to implement the redirect.RewriteRule "^/seolist/(.*)" "/OpaAPI/api/v1/titlelist/$1" [L,R=302]AssumptionsNAC Ingestion Delete functionalityBefore the paginated link application can be deployed to production, a few updates to the existing data and infrastructure would need to happen. First, the NAC Ingest server would need to be updated to delete records when NAIDs are deleted from the Solr search index; as of now they are not removed from the database table. Second, previously-deleted records will be removed based on a list of NAIDs provided by the DAS team. It’s assumed that these updates would be implemented in a separate RFC process prior to the release of this Task Order.This activity was completed with the deployment of NAC RFC 5066.API Field Name WhitelistThe API Field Name Whitelist is understood to be a subset of the fields that make up the DAS data model, combined with Catalog-specific fields for Digital Objects and Public Contributions. The API field name whitelist will be used to validate fields as they are indexed in the Solr search server, and to validate user-entered query parameters.The API field name whitelist contains fields that are not exported by DAS to NAC, and therefore will not be indexed for any records.The API field name whitelist is subject to change and any proposed changes will be handled via RFCs. If there are any additions to the API field name whitelist, a full re-index will be required to associate any of the new paths to the documents.See Appendix A for a reference to the API field name whitelist.SecurityDigital Object ProcessingUsing standard AWS security mechanisms, PPC will ensure that only authorized IAM users will have access to create and/or update objects in S3. Similar restrictions will be applied to AWS Lambda Function policies, such that only authorized accounts can execute and update the AWS Lambda Functions.REST API Modifications and EnhancementsAny new operators and fields will be added to the existing whitelist and current security mechanisms will not be altered in any way. Catalog Search Engine OptimizationThere are no anticipated changes to the Catalog UI that would alter the security posture of the Catalog system.Appendix AThe link below is a Zip file containing the API field name whitelist. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download