Search Technologies Assessment - Archives



National Archives and Records AdministrationNational Archives Catalog (The Catalog)NARA Catalog System Design– Catalog Perspective –Status-FinalVersion 1.6July 8, 2015National Archives & Records AdministrationNARA Catalog System DesignVersion 1.6Contract Number GS-35F-0541UOrder Number NAMA-13-F-0120July 8, 2015Contents TOC \o "2-3" \h \z \t "Heading 1,1" 1Overview PAGEREF _Toc382431831 \h 51.1High-Level Architecture PAGEREF _Toc382431832 \h 61.2NARA Catalog in Context PAGEREF _Toc382431833 \h 71.2.1NARA Catalog Production PAGEREF _Toc382431834 \h 71.2.2NARA Catalog Sandbox PAGEREF _Toc382431835 \h 91.3Applicable Requirements PAGEREF _Toc382431836 \h 101.3.1Sandbox Environment and Segregated Storage PAGEREF _Toc382431837 \h 101.3.2Performance Requirements PAGEREF _Toc382431838 \h 101.3.3Availability PAGEREF _Toc382431839 \h 121.3.4Volume PAGEREF _Toc382431840 \h 131.3.5Security Requirements PAGEREF _Toc382431841 \h 142Hardware and Network Design PAGEREF _Toc382431842 \h 222.1Production System PAGEREF _Toc382431843 \h 222.1.1Assumptions PAGEREF _Toc382431844 \h 222.1.2Server Hardware PAGEREF _Toc382431845 \h 222.1.3NARA Catalog Storage Hardware PAGEREF _Toc382431846 \h 282.1.4Network Hardware PAGEREF _Toc382431847 \h 282.2Sandbox Environment PAGEREF _Toc382431848 \h 302.3Development System PAGEREF _Toc382431849 \h 312.4UAT System PAGEREF _Toc382431850 \h 322.4.1UAT to PROD Proceedure PAGEREF _Toc382431851 \h 322.5Example 2014 and 2015 NARA Catalog Prod Computations PAGEREF _Toc382431852 \h 332.5.1Example Server Requirements PAGEREF _Toc382431853 \h 332.5.2Elastic Scalability PAGEREF _Toc382431854 \h 342.5.3Unknowns PAGEREF _Toc382431855 \h 352.5.4Computing Server Requirements for Index Entries of Varying Size PAGEREF _Toc382431856 \h 353Operating System Design PAGEREF _Toc382431857 \h 373.1Kernel Configuration PAGEREF _Toc382431858 \h 373.2Memory Configuration PAGEREF _Toc382431859 \h 373.3Accounts PAGEREF _Toc382431860 \h 373.4Auditing PAGEREF _Toc382431861 \h 383.5Ports Configuration PAGEREF _Toc382431862 \h 383.6Clock Synchronization PAGEREF _Toc382431863 \h 383.7SSH PAGEREF _Toc382431864 \h 393.8Maintaining and Patching the Operating System PAGEREF _Toc382431865 \h 394Storage Design PAGEREF _Toc382431866 \h 404.1Storage Technology for NARA Catalog Prod PAGEREF _Toc382431867 \h 404.1.1Version 1 PAGEREF _Toc382431868 \h 404.1.2Version 2 PAGEREF _Toc382431869 \h 414.2Structure PAGEREF _Toc382431870 \h 424.2.1Project Directories PAGEREF _Toc382431871 \h 434.2.2NAID Directories / Separate Environments PAGEREF _Toc382431872 \h 444.2.3SFTP Server Access PAGEREF _Toc382431873 \h 465Backups & Recovery PAGEREF _Toc382431874 \h 475.1Backups PAGEREF _Toc382431875 \h 475.1.1Backup Schedules PAGEREF _Toc382431876 \h 475.1.2Backup Details PAGEREF _Toc382431877 \h 475.1.3Backup Storage PAGEREF _Toc382431878 \h 485.1.4Backup for NARA Catalog Storage PAGEREF _Toc382431879 \h 485.2Recovery from Server Failure PAGEREF _Toc382431882 \h 485.2.1Database Servers PAGEREF _Toc382431883 \h 485.2.2Content Processing / Ingestion Servers PAGEREF _Toc382431884 \h 485.2.3Search Engine Servers PAGEREF _Toc382431885 \h 495.2.4Application Servers PAGEREF _Toc382431886 \h 495.2.5Reporting, Monitoring & Admin Control PAGEREF _Toc382431887 \h 495.3Recovery from Site Failure PAGEREF _Toc382431888 \h 496System Monitoring PAGEREF _Toc382431889 \h 50Version ControlVersionDateReviewerSummary Description1.02014-03-02Paul NelsonComplete first version for NARA review1.12014-03-16Paul NelsonIncorporate changes from DCRF1.22014-04-11Madhu KoneniAdjusted the servers configuration based on what AWS provides1.32014-05-21Paul NelsonUpdates from NARA SE Architecture review1.42014-11-14Kristy MartinRemoved “Confidential to Search Technologies” text from the footer1.52014-11-24Brandon StahlReplaced url with url1.62015-07-08Brandon StahlRebranded OPA as NARA CatalogOverviewThis document is the system design including hardware specifications for the National Archives Catalog system currently being developed for the National Archives and Records Administration (NARA).Specifically, this document will cover:Server requirements for NARA Catalog Production, including:Server machinesServer specificationsDisk space requirements for NARA Catalog Production, including:Type of disk spaceSize and I/O access requirementsNetworking requirements for NARA Catalog Production, including:Network connectivity to the internetNetwork connectivity to NARANetLoad-balancing / routingRequirements for other NARA Catalog Systems, including:The sandbox environmentThe developer environmentUAT environmentOther system tools, including:SFTP service for ingestion of digital objectsSystem monitoringHigh-Level ArchitectureThe following diagram provides an overview of all NARA Catalog systems:The purpose of each system is as follows:Content Processing – is back-end system responsible for ingestion, maintaining NARA Catalog storage, and keeping the search engine indexes up-to-date.Search Array – Is the search engine itself, structured as a series of independent search nodes, each one responsible for searching a portion of the entire index (index portions are called “shards” and should hold around 25-50 million records). Each search node has a redundant copy to increase query capacity and for failover.NARA Catalog Storage – Is the long-term content storage for all publicly available NARA data. NARA Catalog storage will contain a copy of NARA data so that it can be delivered quickly and efficiently to the public.Annotations and Registration Database – Contains registered user account information, tags, comments, transcriptions and translations as well as bookkeeping information for all annotations such as lists of recently created or modified annotations, annotations per user, etc.Application Server – This is the system (likely made up of multiple servers) which handles all end-user and authorized user requests.Client – Client software will be written in Javascript and HTML-5 and will run inside the user’s web browser.NARA Catalog in ContextThe following diagrams show how NARA Catalog fits “into context” with the other processes and systems in the National Archives.There will be two contextual diagrams, one for NARA Catalog Production, and a second for the NARA Catalog Sandbox system. (Note: These diagrams are preliminary)NARA Catalog ProductionThe following diagram shows how NARA Catalog Production fits into the rest of the NARA environment:Content ProvidersIn the above diagram, systems which provide content (or are anticipated to someday provide content) are shown on the left, and consumers of NARA Catalog Production services are shown on the right.NARA Catalog Production will receive updates from The NARA Description & Authority Service (DAS) as well as the Digital Processing Environment (DPE) through the Trusted Repository (TBD) (only trusted content with full digital provenance should be ingested into NARA Catalog). Content updates will be provided by content owners as new files are scanned and/or content modifications are required. This can include storage of content (for example, from digitization partners) for future processing.Consumers of NARA Catalog ServicesUsers of NARA Catalog include the following categories. Of course, any single person can be in all of these roles (the categories are not exclusive):Non-Professional Users – These are members of the public who are professional or academic researchers. These users fall into various categories:Occasional searchers – Log into NARA on occasion for occasional searching, for example to find family members or fellow soldiers.Contributors – These are users who help contribute to the archive with comments, tags, transcriptions, or translations.Researchers – Researchers are looking for specific source materials for specific research goals. For example, to research a biography of a famous politician.Third Party API users – These are third party organizations that wish to interact programmatically with the NARA Catalog system, for example to bulk export images (The Digital Public Library of America, or Wikipedia) or to create new custom interfaces for searching NARA Catalog content.NARA Research Support Services – These are NARA employees who help researchers. It is expected they will be users of NARA Catalog to help the public find information.NARA Contribution Moderators – These are NARA employees who review contributions from the public. Content which is spam or vandalism will be removed (with a comment).NARA Authorized Users – These are NARA employees responsible for managing the user account database. They can deactivate and re-activate registered users and respond to support call requests (e.g. change my password, etc.).NARA Catalog SandboxThe purpose of NARA Catalog Sandbox is to “trial run” new content before it is posted on-line to the general public. In this capacity, it will have a different set of content providers and consumers, as shown below:It is expected that DPE will provide content directly to the NARA Catalog sandbox, so the content can be tested in NARA Catalog before it is written to the trusted repository. Similarly, NARA Catalog sandbox will need to pull description data from DAS, as it would normally need to do for any sort of content ingestion.NARA Catalog sandbox is not available for public consumption. Instead, the NARA Catalog sandbox application will be used only by:Content Owners – Who need to view and test their content in NARA Catalog Sandbox so they can verify its accuracy before it is moved on-line for the public.NARA Archivists – Who will need to view the content as well, to ensure that it meets archival standards.The NARA Catalog Review Board – An interdisciplinary group who verifies the quality of the content (and the data description files) as necessary before content can be officially moved to the public.Applicable RequirementsThe requirements which drive the system design are identified in the following table along with the section of the document to which the requirement is allocated.Sandbox Environment and Segregated StorageRequirementRequirement TextSection2.3.1The NARA Catalog system shall provide a sandbox for a data producer to deposit records. REF _Ref381544345 \r \h 2.22.3.1.1The sandbox shall allow indexing of the deposited records. REF _Ref381544345 \r \h 2.22.3.1.2The sandbox shall allow searching of the deposited records by authorized users. REF _Ref381544345 \r \h 2.22.3.2The sandbox shall index records that are not yet released for search by public users. REF _Ref381544345 \r \h 2.22.3.3The NARA Catalog system shall exclude records from the search that are not yet released for public access. REF _Ref381544345 \r \h 2.22.3.3.1The NARA Catalog system shall provide the capability for a System Administrator to set an embargo date on data that will not be available to the public. REF _Ref381544387 \r \h 4.22.3.3.2The NARA Catalog system shall have a segregated storage space for digital objects that are not yet publicly available for search. REF _Ref381544387 \r \h 4.22.10The?NARA Catalog?system shall provide a staging area to store?SEIP?packages that do not contain a description in?DAS. REF _Ref381544407 \r \h 4.22.11The?NARA Catalog?system shall provide a staging area to store digital objects that do not contain a description in?DAS. REF _Ref381544407 \r \h 4.2Performance RequirementsRequirementRequirement TextSection10.1The NARA Catalog system shall have response times for returning a search result. REF _Ref381554933 \r \h 2.1.210.1.1The NARA Catalog system response time for returning a search results shall be less than 1 second for 90% of queries, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.1.2The NARA Catalog system response time for returning a search results shall be less than 2 second for 98% of queries, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.1.3The NARA Catalog system response time for returning a search results shall be less than 3 second for 99% of queries, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.1.4The NARA Catalog system response time for returning a search results shall be less than 5 second for 99.99% of queries, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.2The NARA Catalog system shall have response times for navigating between screens. REF _Ref381554933 \r \h 2.1.210.2.1The NARA Catalog system response times for navigating between screens when no search is involved shall be 99% within 1 second, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.2.2The NARA Catalog system response times for navigating between screens when no search is involved shall be 99.99% within 2 seconds, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.2.3The NARA Catalog system response times for navigating from page to page of search results shall be less than 1 second for 90% of queries, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.2.4The NARA Catalog system response times for navigating from page to page of search results shall be less than 2 seconds for 98% of queries, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.2.5The NARA Catalog system response times for navigating from page to page of search results shall be less than 5 seconds for 99.99% of queries, not including network transfer to/from the browser. REF _Ref381554933 \r \h 2.1.210.2.6The NARA Catalog system response times for navigating between screens that are not search results shall be a maximum of one (1) second. REF _Ref381554933 \r \h 2.1.210.3The NARA Catalog system shall be capable of supporting at a minimum one (1) million user accounts. REF _Ref381554985 \r \h 2.1.2.110.4The NARA Catalog system shall be able to provide sustained query performance of no less than sixty (60) queries per second for queries executed in sequence. REF _Ref381554996 \r \h 2.1.210.4.1The NARA Catalog system shall provide a procedure for increasing query capacity (queries per second) as needed to handle expected capacity increases, with a maximum required lead time of two (2) weeks. REF _Ref381555016 \r \h 2.5.210.4.2The NARA Catalog system shall provide a procedure for decreasing query capacity (queries per second) when increased query capacity is no longer required, but not less than the base query capacity provided at production launch. REF _Ref381555016 \r \h 2.5.210.5<allocated to NARA Catalog Search Engine Design>10.6The NARA Catalog system shall support normal traffic, at a minimum, two-thousand (2,000) concurrent users. REF _Ref381555076 \r \h 2.1.210.7The NARA Catalog system shall support surge traffic, at a minimum, twenty-thousand (20,000) concurrent users. REF _Ref381555076 \r \h 2.1.210.8The NARA Catalog system shall be able to provide peak query performance of one-hundred (100) queries per second, for queries executed in sequence. REF _Ref381555076 \r \h 2.1.2AvailabilityRequirementRequirement TextSection11.1The NARA Catalog system shall be 99.5% available for search and other processing 24 hours a day/7 days a week.5.211.2The NARA Catalog system shall be able to recover system functionality after a failure.5.211.2.1The NARA Catalog system shall be able to recover search functionality from any hardware failure within the NARA Catalog system within 90 minutes.5.211.2.2The NARA Catalog system shall be able to recover search functionality from any software failure within the NARA Catalog system within 90 minutes.5.211.2.3The NARA Catalog system shall be able to recover tags from any hardware failure within the NARA Catalog system within 90 minutes.5.211.2.4The NARA Catalog system shall be able to recover tags from any software failure within the NARA Catalog system within 90 minutes.5.211.2.5The NARA Catalog system shall be able to recover comments from any hardware failure within the NARA Catalog system within 90 minutes.5.211.2.6The NARA Catalog system shall be able to recover comments from any software failure within the NARA Catalog system within 90 minutes.5.211.2.7The NARA Catalog system shall be able to recover translations from any hardware failure within the NARA Catalog system within 90 minutes.5.211.2.8The NARA Catalog system shall be able to recover translations from any software failure within the NARA Catalog system within 90 minutes.5.211.2.9The NARA Catalog system shall be able to recover transcriptions from any hardware failure within the NARA Catalog system within 90 minutes.5.211.2.10The NARA Catalog system shall be able to recover transcriptions from any software failure within the NARA Catalog system within 90 minutes.5.211.2.11The NARA Catalog system shall be able to recover login functionality from any hardware failure within the NARA Catalog system within 90 minutes.5.211.2.12The NARA Catalog system shall be able to recover login functionality from any software failure within the NARA Catalog system within 90 minutes.5.211.2.13The NARA Catalog system shall be able to recover API functionality from any hardware failure within the NARA Catalog system within 90 minutes.5.211.2.14The NARA Catalog system shall be able to recover API functionality from any software failure within the NARA Catalog system within 90 minutes.5.211.2.15The NARA Catalog system shall be able to recover the ingest functionality from any hardware failure within 2 days.5.211.2.16The NARA Catalog system shall be able to recover the ingest functionality from any software failure within 2 days.5.211.2.17The NARA Catalog system shall be able to recover the reporting functionality from any hardware failure within 2 days.5.211.2.18The NARA Catalog system shall be able to recover the reporting functionality from any software failure within 2 days.5.211.2.19The NARA Catalog system shall be able to recover authorized user interfaces from any hardware failure within 2 days.5.211.2.20The NARA Catalog system shall be able to recover authorized user interface functionality from any software failure within 2 days.5.211.2.21The NARA Catalog system shall be able to recover the information exchange functionality from any hardware failure within 2 days.5.211.2.22The NARA Catalog system shall be able to recover the information exchange functionality from any software failure within 2 days.5.211.2.23The NARA Catalog system shall be able to recover all functionality from a site-wide system failure within 7 days, with no more than 48 hours of data loss. REF _Ref381555207 \r \h 5.3VolumeRequirementRequirement TextSection12.1The NARA Catalog system configuration shall provide the capability to scale on demand. REF _Ref381555263 \r \h 2.5.212.1.1The NARA Catalog system architecture shall be capable of supporting a minimum of 10,000 terabytes of NARA Catalog source data, and scalable up to 57,000 terabytes of?NARA Catalog source data. REF _Ref381555281 \r \h 2.1.312.1.2<Allocated to Search Engine Design> REF _Ref381555341 \r \h 412.1.2.1The NARA Catalog system architecture shall be capable of holding a minimum of 500 million digital objects. REF _Ref381555281 \r \h 2.1.3, REF _Ref381555306 \r \h 2.1.2.312.1.3<Allocated to Search Engine Design>12.1.3.1The NARA Catalog system architecture shall be capable of holding a minimum of 20 million archival description records. REF _Ref381555306 \r \h 2.1.2.312.1.4The NARA Catalog system architecture shall be capable of supporting a minimum of 20 million authority records. REF _Ref381555306 \r \h 2.1.2.312.1.4.1The NARA Catalog system architecture shall be capable of holding a minimum of 10 million authority records. REF _Ref381555306 \r \h 2.1.2.3Security RequirementsThe following security requirements are allocated to the system design.RequirementRequirement TextSection13.1The NARA Catalog system shall be implemented in compliance with NARA security guidance?as provided by NARA in the NARA Catalog and Cloud Service Provider Baseline Security Controls. REF _Ref381555694 \r \h 313.1.1The NARA Catalog system shall be delivered with any guest accounts disabled for COTS products installed on the system. (1.1 Access Control, AC-2) REF _Ref381555705 \r \h 3.313.1.2The NARA Catalog system shall automatically terminate temporary and emergency accounts after a period not to exceed 15 days for unclassified information systems (1.1 Access Control, AC-2 (2)) REF _Ref381555705 \r \h 3.313.1.3The NARA Catalog system shall automatically disable inactive accounts after [a period not to exceed 365 days]. (1.1 Access Control, AC-2 (3)) REF _Ref381555705 \r \h 3.313.1.4The NARA Catalog system shall automatically audit account creation, modification, disabling, and termination actions and notifies, as required, appropriate individuals. (1.1 Access Control, AC-2 (4)) REF _Ref381555721 \r \h 3.413.1.5The NARA Catalog system shall isolate the programs and data areas of users from other users and the system itself. (1.1 Access Control, AC-3) REF _Ref381555829 \r \h 3, REF _Ref381555838 \r \h 4.213.1.5.1The NARA Catalog system shall provide for the capability to enforce role-based access control policies. (1.1 Access Control, AC-3) REF _Ref381555847 \r \h 3.313.1.6The NARA Catalog system shall enforce approved authorizations for controlling the flow of information within the system and between interconnected systems in accordance with applicable policy. (1.1 Access Control, AC-4) REF _Ref381555862 \r \h 3.5, REF _Ref381555868 \r \h 2.1.413.1.7The NARA Catalog system shall provide the capability to enforce the concept of least privilege, allowing only authorized accesses for users (and processes acting on behalf of users) which are necessary to accomplish assigned tasks in accordance with NARA missions and business functions. (1.1 Access Control, AC-6) REF _Ref381555883 \r \h 4.2, REF _Ref381555894 \r \h 3.313.1.7.1The NARA Catalog system shall be able to enforce restrictions for access to security-related functions. (1.1 Access Control, AC-6 (1)) Examples of security functions include but are not limited to: establishing system accounts, configuring access authorizations (i.e., permissions, privileges), setting events to be audited, system programming, system and security administration, and other privileged functions. REF _Ref381555904 \r \h 3.313.1.8The NARA Catalog system shall enforce a limit of [a maximum of 5] consecutive invalid login attempts by a user during a [15 minute period]. (1.1 Access Control, AC-7a) REF _Ref381555904 \r \h 3.313.1.8.1The NARA Catalog system shall automatically [locks the account/node for at least 15 minutes ] when the maximum number of unsuccessful attempts is exceeded. (1.1 Acces Control, AC-7b) REF _Ref381555904 \r \h 3.313.1.9The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that provides privacy and security notices consistent with applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance. (1.1 Access Control, AC-8a) REF _Ref381555904 \r \h 3.313.1.9.1The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states users are accessing a U.S. Government information system. (1.1 Access Control, AC-8a) REF _Ref381555904 \r \h 3.313.1.9.2The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states system usage may be monitored, recorded, and subject to audit. (1.1 Access Control, AC-8a) REF _Ref381555904 \r \h 3.313.1.9.3The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states unauthorized use of the system is prohibited and subject to criminal and civil penalties. (1.1 Access Control, AC-8a) REF _Ref381555904 \r \h 3.313.1.9.4The NARA Catalog system shall display an approved system use notification message or banner before granting access to the system that states use of the system indicates consent to monitoring and recording. (1.1 Access Control, AC-8a) REF _Ref381555904 \r \h 3.313.1.9.5The NARA Catalog system shall retain the notification message or banner on the screen until users take explicit actions to log on to or further access the information system. (1.1 Access Control, AC-8b) REF _Ref381555904 \r \h 3.313.1.9.6The NARA Catalog system shall display the system use information when appropriate, before granting further access. (1.1 Access Control, AC-8c) REF _Ref381555904 \r \h 3.313.1.9.6.1The NARA Catalog system shall display references, if any, to monitoring, recording, or auditing that are consistent with privacy accommodations for such systems that generally prohibit those activities; and Include in the notice given to public users of the information system, a description of the authorized uses of the system. (1.1 Access Control, AC-8c) REF _Ref381555904 \r \h 3.313.1.10The NARA Catalog system shall be capable of auditing successful and unsuccessful account logon events, account management events, object access, policy change, privilege functions, process tracking, and system events. (1.3 Audit and Accountability, AU-2) REF _Ref381555941 \r \h 3.413.1.11The NARA Catalog system shall be capable of auditing all administrator activity, authentication checks, authorization checks, data deletions, data access, data changes, and permission changes. (1.3 Audit and Accountability, AU-2) REF _Ref381555941 \r \h 3.413.1.12The NARA Catalog system shall produce audit records that contain sufficient information to, at a minimum, establish what type of event occurred, when (date and time) the event occurred, where the event occurred, the source of the event, the outcome (success or failure) of the event, and the identity of any user/subject associated with the event. (1.3 Audit and Accountability, AU-3) REF _Ref381555941 \r \h 3.413.1.12.1The NARA Catalog system shall produce audit records for data requiring moderate or high integrity, the information system shall include the date and time of the event; the component of the information system (e.g., software component, hardware component) where the event occurred; type of event; subject identity; and the outcome (success or failure) of the event.] in the audit records for audit events identified by type, location, or subject. (1.3 Audit and Accountability, AU-3 (1)) REF _Ref381555941 \r \h 3.413.1.13The NARA Catalog system shall allocate audit record storage capacity and configure auditing to reduce the likelihood of such capacity being exceeded. (1.3 Audit and Accountability, AU-4) REF _Ref381555941 \r \h 3.413.1.14The NARA Catalog system shall alert designated NARA officials in the event of an audit processing failure. (1.3 Audit and Accountability, AU-5a) REF _Ref381555941 \r \h 3.4, REF _Ref381555961 \r \h 613.1.14.1The NARA Catalog system shall overwrite the oldest audit records after an audit processing failure, for low or moderate integrity information systems. (1.3 Audit and Accountability, AU-5b) REF _Ref381555979 \r \h 3.413.1.15The NARA Catalog system shall provide an audit reduction and report generation capability. (1.3 Audit and Accountability, AU-7) REF _Ref381555979 \r \h 3.413.1.16The NARA Catalog system shall provide the capability to automatically process audit records for events of interest based on selectable event criteria. (1.3 Audit and Accountability, AU-7(1)) REF _Ref381555979 \r \h 3.413.1.17The NARA Catalog system shall use internal system clocks to generate time stamps for audit records. (1.3 Audit and Accountability, AU-8) REF _Ref381555992 \r \h 3.613.1.18The NARA Catalog system shall synchronize internal information system clocks [or at least every 24 hours] with [NARA’s authoritative time source]. (1.3 Audit and Accountability, AU-8 (1)) REF _Ref381555992 \r \h 3.613.1.19The NARA Catalog system shall protect audit information and audit tools from unauthorized access, modification, and deletion. (1.3 Audit and Accountability, AU-9) REF _Ref381556004 \r \h 3.413.1.19.1The NARA Catalog system shall provide the capability to log actual and attempted machine access to the audit log. (1.3 Audit and Accountability, AU-9) REF _Ref381556004 \r \h 3.413.1.20The NARA Catalog system shall provide audit record generation capability for the list of auditable events defined in AU-2. (1.3 Audit and Accountability, AU-12a) REF _Ref381556004 \r \h 3.413.1.20.1The NARA Catalog system shall allow designated NARA personnel to select which auditable events are to be audited by specific components of the system. (1.3 Audit and Accountability, AU-12b) REF _Ref381556004 \r \h 3.413.1.20.2The NARA Catalog system shall generate audit records for the list of audited events defined in AU-2 with the content as defined in AU-3. (1.3 Audit and Accountability, AU-12c) REF _Ref381556004 \r \h 3.413.1.20.3The NARA Catalog system shall capture error logs from COTS products. (1.3 Audit and Accountability, AU-12) REF _Ref381556032 \r \h 2.1.2.413.1.20.4The NARA Catalog system shall capture Operating System errors. (1.3 Audit and Accountability, AU-12) REF _Ref381556004 \r \h 3.413.1.20.5<Allocated to NARA Catalog Application Server Design13.1.20.6The NARA Catalog system shall co-locate COTS error logs from different locations to a common storage location. (1.3 Audit and Accountability, AU-12) REF _Ref381556032 \r \h 2.1.2.413.1.20.7<Allocated to NARA Catalog Ingestion Design, NARA Catalog Application Server Design, NARA Catalog?Search Engine Design>13.1.20.8The NARA Catalog system shall provide error detection when accessing memory via parity and/or hardware register checking, as available by the cloud environment selected by the government for hosting the NARA Catalog servers. (1.3 Audit and Accountability, AU-12) REF _Ref381556126 \r \h 3.213.1.21The NARA Catalog system shall implement configuration settings for information technology products employed within the information system using [Security Architecture security configuration checklists approved and published by NARA IT Security Staff (NHI)] that reflect the most restrictive mode consistent with operational requirements. (1.5 Configuration Management, CM-6) REF _Ref381556273 \r \h 313.1.21.1<Allocated to NARA Catalog Application Server Design for MySQL and JBoss Configuration>13.1.22The NARA Catalog system shall use the Center for Internet Security guidelines (Level 1) to disable ports, protocols, and/or services identified in the configuration guides. (1.5 Configuration Management, CM-7) REF _Ref381556329 \r \h 3.513.1.23The NARA Catalog system shall provide the capability for backup of the system. (1.6 Contingency Planning, CP-9) REF _Ref381556337 \r \h 5.113.1.24The NARA Catalog system shall provide the capability for the backup of COTS product files as required to restore operational capability. (1.6 Contingency Planning, CP-9) REF _Ref381556337 \r \h 5.113.1.25The NARA Catalog system shall provide the capability for the backup of application files as required to restore operational capability. (1.6 Contingency Planning, CP-9) REF _Ref381556337 \r \h 5.113.1.26The NARA Catalog system shall provide the capability for the backup of configuration support files as required to restore operational capability. (1.6 Contingency Planning, CP-9) REF _Ref381556337 \r \h 5.113.1.27The NARA Catalog system shall provide the capability for the backup of the files listed in the NARA Catalog Administration Guide, Section 5. (1.6 Contingency Planning, CP-9) REF _Ref381556337 \r \h 5.113.1.28The NARA Catalog system shall provide the capability to cancel a scheduled backup process subject based on permissions. (1.6 Contingency Planning, CP-9) REF _Ref381556337 \r \h 5.113.1.29The NARA Catalog system shall provide the capability to cancel a manual backup process subject based on permissions. (1.6 Contingency Planning, CP-9) REF _Ref381556337 \r \h 5.113.1.30The NARA Catalog system shall provide the capability to recover the system. (1.6 Contingency Planning, CP-10) REF _Ref381556373 \r \h 5.2, REF _Ref381556384 \r \h 5.313.1.31The NARA Catalog system shall provide the capability to recover COTS product files to restore operational capability. (1.6 Contingency Planning, CP-10) REF _Ref381556373 \r \h 5.2, REF _Ref381556384 \r \h 5.313.1.32The NARA Catalog system shall provide the capability to recover application files to restore operational capability. (1.6 Contingency Planning, CP-10) REF _Ref381556373 \r \h 5.2, REF _Ref381556384 \r \h 5.313.1.33The NARA Catalog system shall provide the capability to recover configuration support files to restore operational capability. (1.6 Contingency Planning, CP-10) REF _Ref381556373 \r \h 5.2, REF _Ref381556384 \r \h 5.313.1.34The NARA Catalog system shall provide the capability to recover from a hardware failure. (1.6 Contingency Planning, CP-10) REF _Ref381556373 \r \h 5.2, REF _Ref381556384 \r \h 5.313.1.35The NARA Catalog system shall provide the capability to recover from a physical site outage. (1.6 Contingency Planning, CP-10) REF _Ref381556373 \r \h 5.2, REF _Ref381556384 \r \h 5.313.1.36The NARA Catalog system shall uniquely identify and authenticate users (or processes acting on behalf of users). (1.7 Identification & Authentication, IA-2) REF _Ref381556894 \r \h 3.313.1.37The NARA Catalog system shall protect authenticator content from unauthorized disclosure and modification. (1.7 Identification & Authentication, IA-5) REF _Ref381556929 \r \h 3.313.1.38The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce minimum password complexity of [a case sensitive, 8-character mix of upper case letters, lower case letters, numbers, and special characters, including at least one of each]. (1.7 Identification & Authentication, IA-5(1)) REF _Ref381556929 \r \h 3.313.1.38.1The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce at least a [four character change] when new passwords are created. (1.7 Identification & Authentication, IA-5(1)) REF _Ref381556929 \r \h 3.313.1.38.2The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, encrypt passwords in storage and in transmission. (1.7 Identification & Authentication, IA-5(1)) REF _Ref381556929 \r \h 3.313.1.38.3The NARA Catalog system shall, for password-based authentication for non-public NARA Catalog user accounts, enforce password minimum and maximum lifetime restrictions of [1 day minimum, 90 day maximum]; and Prohibit password reuse for [a minimum of 5 for unclassified information systems] generations.(1.7 Identification & Authentication, IA-5(1)) REF _Ref381556929 \r \h 3.313.1.38.4<Not allocated to system design pertains to public users only>13.1.39The NARA Catalog system shall obscure feedback of authentication information during the authentication process to protect the information from possible exploitation/use by unauthorized individuals. (1.7 Identification & Authentication, IA-6) REF _Ref381556929 \r \h 3.313.1.40The NARA Catalog system shall use mechanisms for authentication to a cryptographic module that meet the requirements of applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance for such authentication. (1.7 Identification & Authentication, IA-7) NARA Guidance: This requirement means that cryptographic modules used for identification and authentication must meet FIPS 140-2 standards." REF _Ref381556929 \r \h 3.313.1.41TheNARA Catalog system shall uniquely identify and authenticate non-NARA users (or processes acting on behalf of non-NARA users). (1.7 Identification & Authentication, IA-8) REF _Ref381556929 \r \h 3.313.1.42The NARA Catalog system shall separate user functionality (including user interface services) from information system management functionality. (System and Communications Protection, SC-2) Supplemental Guidance: Information system management functionality includes, for example, functions necessary to administer databases, network components, workstations, or servers, and typically requires privileged user access. The separation of user functionality from information system management functionality is either physical or logical and is accomplished by using different computers, different central processing units, different instances of the operating system, different network addresses, combinations of these methods, or other methods as appropriate. An example of this type of separation is observed in web administrative interfaces that use separate authentication methods for users of any other information system resources. This may include isolating the administrative interface on a different domain and with additional access controls." REF _Ref381556929 \r \h 3.313.1.43"The NARA Catalog system shall prevent unauthorized and unintended information transfer via shared system resources. (System and Communications Protection, SC-4) Supplemental Guidance: The purpose of this control is to prevent information, including encrypted representations of information, produced by the actions of a prior user/role (or the actions of a process acting on behalf of a prior user/role) from being available to any current user/role (or current process) that obtains access to a shared system resource (e.g., registers, main memory, secondary storage) after that resource has been released back to the information system. Control of information in shared resources is also referred to as object reuse. This control does not address: (i) information remanence which refers to residual representation of data that has been in some way nominally erased or removed; (ii) covert channels where shared resources are manipulated to achieve a violation of information flow restrictions; or (iii) components in the information system for which there is only a single user/role." REF _Ref381556929 \r \h 3.313.1.44The NARA Catalog system shall monitor and control communications at the external boundary of the system and at key internal boundaries within the system. (System and Communications Protection, SC-7a) REF _Ref381557093 \r \h 2.1.413.1.44.1The NARA Catalog system shall connect to external networks or information systems only through managed interfaces consisting of boundary protection devices arranged in accordance with the NARA security architecture. (System and Communications Protection, SC-7b) REF _Ref381557093 \r \h 2.1.413.1.44.2The NARA Catalog system shall configure external firewalls to permit only the minimum protocols through that are required for the system to function. (System and Communications Protection, SC-7) REF _Ref381557093 \r \h 2.1.413.1.44.3The NARA Catalog system shall configure external firewalls to ignore external ICMP 'echo' requests to the system. (System and Communications Protection, SC-7) REF _Ref381557093 \r \h 2.1.413.1.44.4The NARA Catalog system shall configure external firewalls to ignore external UDP 'chargen' requests to the system. (System and Communications Protection, SC-7) REF _Ref381557093 \r \h 2.1.413.1.45The NARA Catalog system shall protect the integrity of transmitted information. (System and Communications Protection, SC-8) REF _Ref381557338 \r \h 3.7, REF _Ref381557345 \r \h 4.2.313.1.46The NARA Catalog system shall employ cryptographic mechanisms to recognize changes to information during transmission. (System and Communications Protection, SC-8 (1)) REF _Ref381557338 \r \h 3.7, REF _Ref381557345 \r \h 4.2.313.1.47The NARA Catalog system shall protect the confidentiality of transmitted information. (System and Communications Protection, SC-9) REF _Ref381557338 \r \h 3.7, REF _Ref381557345 \r \h 4.2.313.1.48The NARA Catalog system shall employ cryptographic mechanisms to prevent unauthorized disclosure of information during transmission. (System and Communications Protection, SC-9(1)) REF _Ref381557338 \r \h 3.7, REF _Ref381557345 \r \h 4.2.313.1.49"The NARA Catalog system shall terminate the network connection associated with a communications session at the end of the session or after no more than 30 minutes of inactivity for a backend user. (System and Communications Protection, SC-10) SC-10 Guidance: Long running batch jobs and other necessary operations are not subject to this time limit. REF _Ref381557367 \r \h 2.1.413.1.50<Not allocated to system design pertains to public users only>13.1.51The NARA Catalog system shall implement required cryptographic protections using cryptographic modules that comply with applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance. (System and Communications Protection, SC-13) NARA Guidance: This requirement means that any cryptographic modules used must meet FIPS 140-2 standards." REF _Ref381557338 \r \h 3.7, REF _Ref381557345 \r \h 4.2.313.1.52The NARA Catalog system shall protect the integrity and availability of publicly available information and applications. (System and Communications Protection, SC-14) REF _Ref381557402 \r \h 413.1.53The NARA Catalog system shall prohibit remote activation of collaborative computing devices (if collaborative computing mechanisms are used). (System and Communications Protection, SC-15) REF _Ref381557416 \r \h 313.1.54The NARA Catalog system shall provide mechanisms to protect the authenticity of communications sessions. (System and Communications Protection, SC-23) REF _Ref381557338 \r \h 3.7, REF _Ref381557435 \r \h 4.2.313.1.55The NARA Catalog system shall protect the confidentiality and integrity of information at rest. (System and Communications Protection, SC-28) REF _Ref381557453 \r \h 4, REF _Ref381557459 \r \h 5.113.1.56<Allocated to NARA Catalog Application Server Design>13.1.57<Allocated to NARA Catalog Application Server Design>Hardware and Network DesignThis section covers the anticipated hardware and network required to meet NARA Catalog production initial system requirementsProduction SystemAssumptionsThis production system is scaled to meet the following stated requirements:500 million digital objects30 million records (20 million descriptions, 10 million authorities)2000 sustained concurrent users, 20,000 peakFurther, we assume that every digital object is a separate digital object file as specified in the current system with an <object> tag in the archival description.See section 2.5 for an example for how to compute 2014 and 2015 NARA Catalog Prod requirements for smaller volumes or a mixture of different types of index entries.Server HardwareThe following diagram shows all of the hardware servers and networks anticipated for the NARA Catalog Production system.All standard server machines are expected to be modern machines with the minimum characteristics:2.5mb processor cache per core3 Ghz CPU clock speed or betterSAS (Serial Attached SCSI) Hard DrivesNote: SATA drives will not provide sufficient I/O bandwidth capacity for NARA Catalog applications.SAN storage is also a viable option, as long as IO operations/second are sufficiently capableRAID 1, RAID 5, or RAID 10 for all hard drivesDisks for servers must not be sharedThe architecture is designed to be a “share nothing” systemThe only shared storage is NARA Catalog StorageAll hard disk drives on each machine must be dedicated spindles.This is especially critical for servers listed with IOPS of “high” belowSpecific requirements on RAM and number of processing cores per server are identified below:SystemPurposeCntRAMCoresHDIOPSCommentsdatabaseprimary1122gb16250gbhighMySQL Serverdatabasefailover1122gb12250gbhighMySQL ServerContent Processingprimary230gb82tbmedContent processing & SFTP server. Note that two servers provide capacity.Failure of one server will reduce ingestion capacity.Search Engineprimary2560.5gb161tbhighSolr Search Search servers for 530 million medium-sized index entries divided 25 ways.Search Enginefailover2560.5gb161tbhighFailover row, holds index replicas for primary row.Web ApplicationPrimary 430gb161tblowHolds application servers to handle API requests from end-user interfaces. 4 servers are recommened for load balancing and fail over.Disk space is for holding 1yr of log data.Bulk Exportprimary130gb8100gblowServer to process bulk exports in background. Output is written to NARA Catalog Storage.Reporting, Monitoring, Server Controlprimary130gb162tblowHolds the reporting application, Zookeeper server management, and system monitoring tools. May hold SFTP server as well.Disk space is for holding 2yrs of log data for reporting functions.Reporting, Monitoring, Server Controlfailover130gb162tblowFailover server for admin functions.Disk space is for holding 2yrs of log data for reporting functions.Note:“Cnt” is the number of servers for the specified configurationRAM, Cores, and Hard Disk are “per server” values.Database recommendations come from this sizing guide.Hard Disk Drive ProvisioningThe hard disk numbers above identify different IOPS (I/O Operations Per Second):high – 1000-2000 IOPSmedium – 250-500 IOPSlow – 100-250 IOPSDisk Space for Database ServersThe following spreadsheet provides a very rough estimate of the disk space required for the users and annotations table.Notes:Requirements estimate 1,000,000 users. Estimates for number of transcriptions, translations, tags, etc, are based on this estimate.Number of bytes of data per row and for indexes are estimated based on current table designsA multiplier of 4x is provided for expansion in the MySQL INODB database structure.?Bytes per row?DataIndexesCountTotal (gb)transcriptions4431001,000,000 2.02 translations443100500,000 1.01 tags14850 10,000,000 7.36 comments328502,000,000 2.81 annotations log56320015,600,000 44.25 accounts2741001,000,000 1.39 ?????Total??? 58.85 Since these estimates are very rough, a total disk space of 250gb per server is recommended to provide a 5x buffer for growth in requirements or mis-calculations.Disk Space for Content Processing ServersThe content processing servers will maintain local cache files for the following:All original DAS XML filesTotal number of DAS XML files expected: (initial configuration)20 million (descriptions) + 10 million (authority records) = 30 million totalAverage DAS XML size: ~10kbTotal required disk space: 10,000 * 30,000,000 = 300gbA copy of all ARC XML filesTotal number of ARC XML files expected: (initial configuration)20 million (descriptions) + 10 million (authority records) = 30 million totalAverage ARC XML size: ~10kbTotal required disk space: 10,000 * 30,000,000 = 300gbDatabase cache of parent records and countsParent records: 20,000,000 * 0.25 = 5 millionAbout 25% of DAS descriptions are a parent recordAuthority Records: 10 million Total records in the database cache: 10 million + 5 million = 15 millionExpected bytes per record: 1000Total diskspace required: 15,000,000 * 10000 = 15gbTotal disk space estimated: 300gb + 300gb + 15gb = 615gbRecommended disk space: 2tb to account for expansion of requirements and unexpected growthSearch Engine SizingSize per Index EntryEach index entry is relatively verbose:Entire ARC XML description = 10K bytes / entryTechnical metadata for each digital object = 1K bytes / entryExtracted text content = 4K bytes / entryNote that even though most entries are images (with no extracted text content), PDF files are typically provided which contain OCR text for all of the images.Therefore, the average of 4K bytes / entry holdsAdditional metadata fields: 2K bytes / entryTotal size: 17K bytes / entryNote: These index sizes are substantially larger than the current OPA Pilot system because the full XML description and full object metadata XML is indexed with every index-entry, as is required to handle the API use cases discussed with NARA.Index Entries / NodeThe general consensus for Search Engine indexes are:Small Documents (1K-5K / entry): 50 million index entries / nodeMedium Documents (5K-50K / entry: 25 million index entries / nodeLarge Documents (50K- / entry): 10 million index entries / nodeTherefore, Search Technologies recommends sizing each machine at around 25 million index entries on each node.Total number of index entriesTotal number of records to be indexed into NARA Catalog:20 million archival descriptions10 million authority records500 million digital objectsTotal: 530 million index entriesTotal Number of ServersBased on the above estimates, the total number of servers recommended will be:530,000,000 / 25,000,000 = 21 serversRounding up = 25 serversTwo replicas for query performance and scalabilityTotal servers: 25 * 2 = 50 serversIndex SpaceThe storage required for each search engine server is computed as follows:Total data content = 530 million entries * 17K bytes / entry = 9.1tbIndex content = 9.1tb (same as content size)Total disk required: 18.2tbRound up: 20tbDisk per server: 20tb / 25 servers = 800gb / serverRecommended disk space per server: 1tb / serverDisk Space for Application Servers and Reporting ServersIn order to provide the reports required by the NARA Catalog reporting requirements, every API access will be recorded in log files.An estimate of API accesses include:60 queries / second “sustained” usage (Rqmt 10.4)“Normal traffic” of 2,000 concurrent users (Rqmt 10.6)Assuming API calls of 1 call per every 10 seconds per user provides 200 API calls / secondTotal: 260 API calls / secondAssuming each API call requires about 100 bytesDisk Usage: 26,000 bytes / second = 2.1gb / day of logs generatedHold 1 year worh of logs: 2.1gb * 365 = 766.5gb of logs = 1tb of disk space (rounded up)NARA Catalog Storage HardwareNARA Catalog Storage requirements can be computed in several ways:Requirement 12.1.1: 10,000tb of storage (10 petabytes)Compute space for 500 million digital objects (requirement 12.1.2.1):Current space = 8.4tb for 1.6 million digital objectsScaling up: 8.4 * 500 / 1.6 = 2,625 tb (2.6 petabytes)Current space for all images + digitization partner images: ~85tbNetwork HardwareNetwork hardware required includes:Load balancer: Internet Application ServersLoad balance requests from the internet across 4 application serversThe load balancer to the internet will provide boundary control to external systems (Rqmt 13.1.44, 13.1.44.1)It will be configured to only allow minimum protocols (Rqmt 13.1.44.2), namely HTTP and HTTPS to the application server.The external iCMP ‘echo’ request will be ignored (Rqmt 13.1.44.3), as will external UDP ‘chargen’ requests (13.1.44.4)Router / FirewallFor ingesting new content via SFTP (push)For registering new updates from DAS (pull)The router to NARANet will provide boundary control to NARANet systems (Rqmt 13.1.44, 13.1.44.1)It will be configured to only allow minimum protocols (Rqmt 13.1.44.2), namely HTTP / HTTPS to/from DAS, and SFTP to NARA Catalog storage.The external iCMP ‘echo’ request will be ignored (Rqmt 13.1.44.3), as will external UDP ‘chargen’ requests (13.1.44.4)Network LayoutThe recommended network layout is shown in the following diagram:With the above architecture, routes are carefully controlled to provide as much isolation from internet traffic as possible. In the above diagram the arrow represents “allowed inbound traffic”. Arrows for SSH for system administration are not shown.The specific routes and network-security required will include:Internet traffic application servers (HTTP / HTTPS).Application servers Private Sub-Net A:Access to NARA Catalog Database Servers (Read/Write)Access to Search Servers (read only)The search servers will be configured to limit application servers to the “/select” URL path.Access to NARA Catalog StorageConfigured using NFS mount on the Application Servers.Application Servers Digital Objects (read only)Application Servers Bulk-Export Area (read/write)Content Processing / Reporting & Admin Control Search, Database, NARA Catalog StorageThe servers in private subnets A and B will be able to access each other as needed:Database Content processing (read only)Content Proessing Search (write only)Content Processing NARA Catalog Storage (read/write)Reporting & Admin Control All Servers (read/write)Servers in NARANet will need to be connected to select servers in NARA Catalog:DAS Content Processing (read only)SFTP NARA Catalog Storage (read/write to select directories)Sandbox EnvironmentThe following is the system diagram for the sandbox system:See section REF _Ref381554933 \r \h 2.1.2 above for details on the types of server machines required.Specific requirements on RAM and number of processing cores per server for the sandbox environment are identified below:SystemPurposeCntRAMCoresHDIOPSCommentsContent Processingprimary130gb82tbmedContent processing & SFTP server.Search Engineprimary460.5gb161tbhighSolr Search Search servers for 100 million index entries.Applicationprimary130gb16100gblowHolds application servers to handle API requests from end-user interfaces.The details of the sizing and disk space required per machine are the same as in section REF _Ref381395827 \r \h 2.1.Development SystemThe system diagram for the development system is shown below:Specific requirements on RAM and number of processing cores per server for the development environment are identified below:SystemPurposeCntRAMCoresHDIOPSCommentsdatabaseprimary1122gb16250gbmedMySQL ServerContent Processingprimary130gb82tbmedContent processing & SFTP server. Note that two servers provide capacity.Failure of one server will reduce ingestion capacity.Search Engineprimary460.5gb161tbmedSolr Search Search servers. Can be configured either as 4x1 (for partitions, no failover) or 2x2 (two partitions with failover) as needs dictate.Applicationprimary130gb16500gblowHolds application servers to handle API requests from end-user interfaces.Applicationfailover130gb16500gblowAdditional application server for testing session data persistence across multiple servers.Reporting, Monitoring, Server Controlprimary130gb81tblowHolds the reporting application, Zookeeper server management, and system monitoring tools. May hold SFTP server as well.The details of the sizing and disk space required per machine are the same as in section REF _Ref381395827 \r \h 2.1.UAT SystemFor each new release of NARA Catalog, a UAT system will be required. It is recommended that this system be substantially the same as the production system shown above in section REF _Ref381395827 \r \h 2.1.If a true, elastically scalable cloud environment is available, Search Technologies recommends provisioning the UAT system only “as needed”, around major release dates. This is shown in the following diagram:NEW_UATPRODOLD_PRODPRODReleaseTwo systems are required for only for a 3 month window around releseae.Launch NEW_UAT EnvironmentStart UAT TestPROD burn-in complete. Shut down OLD_PRODUAT to PROD ProceedureThe recommended process for fielding a new UAT system is as follows:Two months before “go live”, launch a new set of virtual machines in the configuration shown in section REF _Ref381395827 \r \h 2.1 NEW_UATThis system should be the same configuration as shown in REF _Ref381395827 \r \h 2.1.Deploy a completely new version of NARA Catalog to the NEW_UAT system.Migrate data to NEW_UAT as needed.Restore backups to NEW_UAT.Reprocess updates since backup was made.This should *not* require a new copy of NARA Catalog Storage.Instead, NEW_UAT will operate on a “test packages” area.Packages will be copied from the production area and modified as plete UAT test on NEW_UAT.When the new system is ready to go live to production, perform a final system validation:Complete a final backup restoreReprocess updates since backup was madeComplete a final system validation testPut NEW_UAT online.Route requests from Now goes to NEW_UATNEW_UAT now becomes PRODPRODUCTION now becomes OLD_PRODMonitor and validate PROD to ensure smooth operation.If there is a fatal problem with NEW_UATRestore: OLD_PROD PRODFix the problem.Return to step 4 above.Once PROD (formerly NEW_UAT) is safe and running smoothly (past the burn-in period):Shut down OLD PROD.Release the virtual machines back to the cloud.Example 2014 and 2015 NARA Catalog Prod ComputationsThis section covers an example of how 2014 and 2015 server requirements could be computed.Note: This information is based on data sets known to Search Technologies which will need to be migrated into NARA Catalog Production in 2014 and 2015.Naturally, Search Technologies is not aware of all of the potential data migrations into NARA Catalog Production which are planned for 2014. Therefore, this section can be viewed as merely as an example of how server requirements could be scaled down should 2014 and 2015 be less than as specified in the NARA Catalog requirements spreadsheet.Example Server RequirementsThe current OPA Pilot system has the following characteristics:Current digital objects: 1.6 millionCurrent archival descriptions: 8 millionCurrent authority records: 1.05 millionExpected growth for calendar year 2015:EOP Packages: 360,000 (15,000 messages x 24 months)Digital partner objects: 12 millionBased on the above information, Search Technologies believes that 50 servers for search may overestimate the requirements for NARA Catalog Production for Calendar years 2014 and 2015.Based on the above requirements, the total number of index entries could rise to:Total index entries (based on above estimates):Archival descriptions: 8 million + 25% = 10 millionAuthority records: 1.05 million + 25% = 1.35 millionDigital objects: (1.6 million + 25%) + 12 million = 14 millionEOP Packages: 360,000 x 2 (for description and digital object) = 0.72 mTotal index entries: 26 millionTo handle 26 million index entries, the following server counts could be modified:Reduce search engines: 50 4Additional servers may be reduced depending on the rate of adoption of NARA Catalog Production:How much will the APIs be used?How many simultaneous users will NARA Catalog Production have in 2014 and 2015?Current OPA Pilot usage is relatively light (0.1 QPS average, 7 QPS peak). Given historical usage, the Application servers may be reduced to 2 (from 4) and the bulk export server may be co-located with the “Reporting, Monitoring, and Server Control” Servers, leading to future reductions in hardware requirements for 2014 and 2015.Elastic ScalabilityDepending on the time it takes to provision new hardware (a function of the cloud environment), reducing hardware for 2014 and 2015 could be a relatively “safe” option, for the following reasons:Adding search server rows for additional QPS will require about 2 weeksMachine instances can be created to launch servers quickly.Servers can be added as new “slave replicas” with simple configuration2 weeks would be required for initial index replication, testing, and to account for possible roll-backs and re-attempts should something go wrong.Adding additional index partitions for additional content will require about 6 weeksMachine instances can be created to launch servers quickly.Servers can be added as new “partitions” with simple configuration6 weeks would be required to re-balance the documents across the partitions, which may require Adding new application servers for additional end-user capacity will require about 3 daysNew application servers can be added at any timeNo complex data replications are required (they all share a master database)All API and UI transactions are stateless (state is carried in cookies and on the client)Note that these times (2 weeks for a search row, 6 weeks for additional index partitions, 3 days for additional application servers) could be reduced with additional testing, scripting, and process documentation.UnknownsThere are a number of unknowns in the calculations above which could cause the systems for 2014 and 2015 to be substantially larger. Specifically:Will all of AAD be indexed as granules? 105 million recordsNote: Depending on API requirements, these could be “small” records.Many more “Small” records can be packed into a single server (as many as 50 million, instead of the 25 million recommended for “medium” records)Will every name in the 1940 census be indexed as granules? 130 million recordsNote: Depending on API requirements, these could be “small” records. See?above.What other major initiatives will be required?Computing Server Requirements for Index Entries of Varying SizeFor index partition computations, a general understanding of the size of the index entry is required. For the purposes of NARA, documents can be classified as “small”, “medium”, and (possibly) “large”, as follows:SmallA single row from a database tableA half-page of textMediumAnything with <archival-description> XML is automatically medium or largerXML metadata for multiple objectsOne or two pages of textLargeOver 25 pages of textComputations involving “small”, “medium”, and large index entries should be based on the following formula:Index partitions = (number of small entries)/50 million + (number of medium entries)/25 million + (number of large entries)/10 millionTotal servers = (index-partitions) * (replicas)Currently we expect replicas = 2 to handle the QPS rates required by NARA Catalog.Note: This formula only works if the small entries are truly small. For example, indexing the entire <archival-description> XML with every small entry will automatically turn all index entries to “medium” size.For example, if all of AAD and if all names in the 1940 census are indexed as “small” entries, then the following computations hold:Medium entries (from previous sub-sections): 26 millionSmall entries (from AAD & 1940 census): 234 millionIndex partitions: (26 million/25 million) + (234 million / 50 million) = 5.72Round up: 6Total search engine servers: 6 * 2 = 12Operating System DesignThe operating system recommended for NARA Catalog is Red Hat Linux – or similar. Red Hat is on the NARA TRM as a recommended Linux variant.Kernel ConfigurationThe kernel configuration will be as delivered. No kernel customizations are required.Memory ConfigurationThe initial memory configuration will be based on default values for Linux.Optimal kernel memory parameters (such as shmmax, file-max, swappines, Huge memory pages, etc.) will be determined based on search engine and MySQL performance tuning, as needed.Modified parameters as needed to achieve required performance will be documented in the administration guide.Parameters required for parity checking (Rqmt 13.1.20.8) will be configured as well. [TBD – Requires help from NARA security team to determine correct parameters]AccountsUnix accounts will be managed as follows:Guest accounts for COTS will be disabled. (Rqmt 13.1.1)Separate accounts for server processes (Rqmt 13.1.7, 13.1.42)Separate accounts for operating system account management (Rqmt 13.1.7, 13.1.42)Login attempts will be limited to a maximum of 5 consecutive invalid attempts by a user during a 15 minute period (Rqmt 13.1.8)The NARA Catalog system shall automatically [locks the account/node for at least 15 minutes ] when the maximum number of unsuccessful attempts is exceeded. (13.1.8.1)The NARA Catalog system shall display an approved system use notification message before granting access. (Rqmts 13.1.9, 13.1.9.1, 13.9.1.2, 13.1.9.3, 13.1.9.4, 13.1.9.5, 13.1.9.6, 13.1.9.6.1)[TBD – need “approved system use notification message” from NARA for linux accounts]Enforce minimum password rules, including:Case sensitive, 8-character mix of upper case letters, lower case letters, numbers and special characters including at least one of each (Rqmt 13.1.38)Enforce at least a four character change when new passwords are created (Rqmt 13.1.38.1)Require password encryption (means requiring SSH for system access) (Rqmt 13.1.38.2)Enforce password minimum and maximum lifetime restrictions (1 day minimum, 90 day maximum) and prohibit password reuse for a minimum of 5 generations (Rqmt 13.1.38.3)Shall use FIPS 140-2 standards for cryptographic modules (Rqmt 13.1.40)AuditingAuditing will be done with the help of the NARA security team, and based on NARA Linux recommended configurations.We anticipate that this will include:Administrator auditing with the “psacct” package and perhaps other packages (such as rootsh logging) (Rqmts 13.1.10, 13.1.11)Appropriate configuration of syslogd, including auditing of successful and unsuccessful account events (Rqmts 13.1.10, 13.1.12)Verification of logs generated to /var/log/security and /var/log/audit/audit.logProtection of logs from unauthorized modification (Rqmts 13.1.19, 13.1.19.1)Capturing operating system errors (Rqmt 13.1.20.4)[TBD – Details will require standard recommended configurations from NARA security team]Ports ConfigurationAll ports will be initially “turned off” for all servers. Then ports will be individually turned on as required for inter-process and external communications to ensure that only the minimum number of ports are enabled. (Rqmts 13.1.22)Clock SynchronizationThe Linux systems will be configured for internal system clocks for auditing, and for synchronizing clocks with NARA’s authoritative time source [TBD – what is NARA’s authoritative time source?]. (Rqmts 13.1.17, 13.1.18)SSHOnly SSH (Secure Shell) will be allowed into NARA Catalog servers for system administration.Maintaining and Patching the Operating SystemThe operating system will be maintained and patched using the cloud-recommended procedures.This may involve:Halt updates to the system.For example, turn of index updates by the ingestion servers.This will limit the amount of data which is changing. Most servers will now have “idle” systems with files that do not change.Taking the server to be patched off-line.Patching the operating system as necessary.Bringing the server back on-line.Re-synchronize database files as necessary.If the ingestion servers are idle, then search engine indexes, application servers, and ingestion servers will not require re-synchronization.Therefore, only the RDBMS may still be receiving updates that require synchronization, when a database server is taken off-line, patched, and then brought back on-line.All critical servers have fail-over siblings which will allow for either one or the other to be brought off-line for patching as necessary.Storage DesignThis section covers the design of “NARA Catalog Storage”.This section does not cover the disk required for each individual server (see section REF _Ref381395827 \r \h 2.1 for more information about individual server disk).Storage Technology for NARA Catalog ProdThe storage technology to be used for NARA Catalog Prod will need to be discussed with the cloud provider. The following are expected storage technologies based on the provider chosen.Version 1It is expected that the technology of NARA Catalog storage for NARA Catalog Prod, version 1 will be a simple, mounted disk drive.The underlying technology will depend on the cloud environment chosen:For FDC – this will be a NetApp diskFor Amazon Cloud – this will be Elastic Block Storage (EBS)For other cloud systems – this will be standard mounted disk drives.Note: Is using a cloud system, an additional server may be required to service NFS requests from all other NARA Catalog server machines.The exact storage mechanism will need to be discussed with the cloud provider – once this is determined by NARA.Depending on the cloud environment and the storage options provided, additional tasks may be required to achieve the high IOPS required by the database and the search engine servers for NARA Catalog. This may include:Striping the disks for higher I/O PerformanceHaving many separate volumes instead of a small number of very large volumesUsing different types of disks (i.e. “Provisioned” storage vs “Standard” storage)The exact configuration and disk mounting steps will be determined once the cloud environment and storage technology are determined.Server Access to NARA Catalog StorageNARA Catalog Storage will be NFS mounted to all of the servers that require access to it. This includes:Ingestion servers (read/write)Application servers (read/write)Reporting and server management servers (read only access)Bulk export server (read/write)For more fine-grained security controls, see below.Note that the search engine servers and database servers will not require access to NARA Catalog Storage.Version 2Again, depending on the cloud provider, NARA Catalog will use a shared high-volume cloud storage technology, specifically, Amazon S3.Amazon S3 has a better price per terabyte than mounted disk drives.However, this will need to be deferred to version 2 for the following reasons:Cloud storage providers are accessed through custom RESTful interfacesThis requires additional programming for reading and writing every file to the storage system.The performance metrics are unknownAdditional benchmarking and performance testing will be requiredAdditional management and monitoring tools may be requiredStructureThe structure of NARA Catalog storage is shown in the following diagram:The sub-directores are as follows:/opa/bulk/Holds bulk-export files<export-files>Currently holds bulk-export files. May be divided into multiple mount points later if the size / quantity of the bulk exports require it.dev/Holds content for the development environment<naid-directories>See section REF _Ref382431739 \r \h 4.2.2future/Embargoed & digitization partner content (R-2.3.3.2)<project-directories>Every project has a separate directory / mountpre/The pre-ingestion area, a holding area for new contentupdates/Used to hold new digital objects to ingestquarantineHolds quarantine packages updates (R-2.10, R-2.11)eop/Every project has a separate directory / mountquarantineHolds quarantine packages eop packages(R-2.10, R-2.11)<other-projects>/Every project has a separate directory / mountquarantineHolds quarantine packages for the specified project(R-2.10, R-2.11)prod/Holds content for the production environment<naid-directories>See section REF _Ref382431750 \r \h 4.2.2sandbox/Holds content for the sandbox environment<naid-directories>See section REF _Ref382431756 \r \h 4.2.2Project DirectoriesProject directories will be created as one per project.For the “future” directoryProject directories will be for different digitization projects / partnersThis may contain embargoed dataAccess controls will be based on the individual project needs:Full access to the project manager and designatesAccess may be revoked once the project is “complete”Full access to system administratorsNo access to any NARA Catalog server processFor the “pre” directory“eop” – Holds new SEIPs from the EOP system“quarantine” – packages which fail are copied to quarantine.The reason for the failure will be in the log files.Access controls:Full access for systems and users which produce EOP SEIP packagesFull access to system administratorsFull access to NARA Catalog ingestion serversNo access to other systems or individuals“updates” – Holds partial NARA Catalog-IP directories for updated digital objects.Directories will be named with the description “naid”, and contain:objects.xml – An XML file describing the object ID, object information, and files for each object ID.content – A sub-directory holding the actual content files.“quarantine” – Pre-ingestion packages which fail are copied to quarantine.The reason for the failure will be in the log files.Access controls:Full access for systems and users which produce new digital objectsFull access to system administratorsFull access to NARA Catalog ingestion serversNo access to other systems or individuals<other projects> - Other directories may be created as necessary to handle additional data flows for new projects.For example, a new pre-ingestion project directory will be created for each digitization partner projectAccess controls:Full access to the project ownerFull access to system administratorsFull access to NARA Catalog ingestion serversNo access to other systems or individualsWhat’s a Mount Point?At this juncture, without knowing the cloud environment and without a definitive answer on the storage technology to be used, it is impossible to know what project directories will be separate mount points.Tentatively:Every project directory inside “future” will be a separate mount pointIt is expected that each of these directories will represent a significant amount of data.Currently, there are 64tb of data in the “future” directory in OPA Pilot.The entire “pre” directory could be a single mount point.Since this is a transient directory, it is not expected to require much disk space.However, some of the <other-projects> may need to be separate mount points (TBD) depending on the amount of data and whether content is provided all at once, or in batches.NAID Directories / Separate EnvironmentsDev/Test, Sandbox, Production, and Backup will all have NAID based directory structures. The NAID will be the same NAID for the object as specified in DAS for the associated archival description.The directory structure will be as follows:Level0: Environment (“dev”, “prod”, “sandbox”)Level1: NAID mod (numLevel1Dirs)Each environment will have a configured number of “Level1” directories.For the first production release this will be 10Maximum anticipated is 100Depending on the type of storage, each top-level directory may be a different mount-point.Level2: (NAID/(numLevel1Dirs)) mod 10000Each level2 directory will contain up to 10,000 sub-directories.Level3: des-NAIDThe entire NARA Catalog-ID will be used as the directory name for the NARA Catalog Information Package.For example, for the NAID 5541536, the following levels would apply:And the final directory would be: /opa/prod/36/5415/des-5541536The purpose of using the lowest digits for level1 and level2 is to allow for a random distribution of files amongst those levels, so that a single directory path will be less likely to grow at a larger proportion than other directories.Total size of the storage, assuming:Level1 directories = 100Level2 directories = 10,000Level3 directories = 10,000…is 10 billion packages (e.g. 10 billion descriptions).Access ControlsThe access controls for each of the different environments will be:“dev”Full access – All NARA Catalog staff (developers, system administrators, testers, etc.)Full access – All development server accounts“prod”Full access – Ingestion server accountsFull access – Application server accountsThis is required to write transcriptions and translations into NARA Catalog Storage.This may be changed to read/write, depending on whether or not application servers need to create sub-directories inside of NARA Catalog-IPs [Design TBD]Read Access – Reporting, monitoring, and server management accountsRead Access – Bulk-export server accountFull access – System administrators“sandbox”Full access – Sandbox ingestion server accountsRead access – Sandbox application server accountsFull access – System administratorsSFTP Server AccessSFTP servers will be installed on the content processing servers. SFTP servers will have access to:/opa/pre – The pre-ingestion area/opa/future – The future projects / digitization partner / embargoed data areaNotes on SFTP configuration:Anonymous SFTP access must be disabled.All users will require an account on NARA Catalog in order to upload content via SFTP.SFTP will only be available from NARANet.Access to the /opa/pre and the /opa/future directories will be configured with operating system access control as described in the previous sections.Backups & RecoveryThis section covers backup and recovery methods.BackupsNote that backups are required for the production environment only.Backup SchedulesThe following backup strategies will be required:MySql DatabasesDaily incremental backupWeekly full backupSearch engine index filesDaily incremental backupWeekly full backupContent Processing / Ingestion serversWeekly full backup of cache filesApplication ServersWeekly copy of log files to the reporting servers.Reporting, Monitoring, and Admin ControlWeekly incremental backup of log filesBackup DetailsDetails on the backup mechanism for each type of server will be outlined in the individual design documents:For search engines: NARA Catalog Search Engine DesignFor application servers: NARA Catalog Application Server DesignFor the MySQL Database: NARA Catalog Application Server DesignFor the Reporting, Monitoring and Admin Servers: NARA Catalog Reporting DesignFor content processing: NARA Catalog Ingestion DesignNote that all backup scripts and processes will be fully documented in the Administrator Guide when the system is delivered.Backups of COTS software and system configuration files will be done after every deployment. This will be documented in the Deployment Guide.Backup StorageBackup Storage will be implemented with Amazon AWS “Glacier” storage.Backup for NARA Catalog StorageIn order to meet up-time and recovery requirements for site-wide disaster scenarios, all of NARA Catalog storage will need to be backed up. Further, the backup should be done off-site.Due to the size of NARA Catalog storage, the backup method will depend on the cloud environment chosen for NARA Catalog. The method of backup will be determined after consultation with NARA and the cloud provider.Recovery from Server FailureThe recovery method will depend on the type of server.Database ServersA primary and a failover exist for the MySQL database servers.A failure of either server will mean operating with a single server until its sibling server can be restored and the database mirrored.See the NARA Catalog Application Server Design for details on the database recovery process.Content Processing / Ingestion ServersRecovery times for content processing servers are longer (2 days) than for other servers (90 minutes). Therefore, the recovery process for content processing will be:Launch a new virtual machine for the serverDeploy the appropriate software to the serverSteps 1 & 2 could be combined if a “machine images” of the server is saved as part of the deployment procedure.Note that maintenance of machine images are not specified as requirements.Copy the appropriate backups to the server.Reprocess updates since the last backup was saved.See the NARA Catalog Ingestion Design for details on the ingestion server recovery process.Search Engine ServersA primary and failover server exists for each search engine server.Therefore, a failure of either server will mean operating with a single server for the specified index partition until its sibling server can be restored and the index copied. See the NARA Catalog Search Engine Design for details on the database recovery process.Application ServersApplication servers can be recovered at any time by simply launching a new instance of the server and adding it to the server farm. No recovery of backups is required.See the NARA Catalog Application Server Design for details on the database recovery process.Reporting, Monitoring & Admin ControlA primary and a failover exist for the MySQL database servers.A failure of either server will mean operating with a single server until its sibling server can be restored and the backups recovered.See the NARA Catalog Reporting Design for details on the recovery process for reporting functions.See the NARA Catalog Search Engine Design for details on the recovery process for Zookeeper.Recovery from Site FailureRecovery from site failure will require the following steps:Launch new copies of all server instances.Restore all databases from the latest backups.Reprocess updates since the latest backups were required.Re-index records as required.It is conceivable that a complete re-index of all NARA Catalog content will be required to recover from a site-failure.If this is the case, then multiple content ingestion servers may need to be launched to reprocess records in parallel, to perform a complete re-index within 7 days.System MonitoringSystem monitoring will be performed using Amazon CloudWatch monitoring services. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download