Introduction - National Archives



National Archives and Records AdministrationPreventative, Incidental, and Routine Maintenance DocumentVersion 1.0July 10, 2018Prepared for:National Archives and Records Administration II (NARA II)College Park, MDPrepared by:1760 Old Meadow RoadMcLean, VA 22102295275847725This document contains proprietary information provided by DSA Company.Handle in accordance with proprietary and confidential restrictions.00This document contains proprietary information provided by DSA Company.Handle in accordance with proprietary and confidential restrictions.Acknowledgements:DSA would like to acknowledge the significant technical contributions made by the following staff:Dr. Urmi MajumderAdil LatiwalaBertina BattouRaven MooreByrav KadalurTimothy ReynoldsEphriam TohGang ChenVeluna ChristopherFawad ShaikhRecord of ChangesPreventive, Incidental, and Routine Maintenance Document Change LogVersionDate of ChangeChanged bySummary of Changes.0506/20/2018Urmi Majumder & Fawad ShaikhInitial document.0606/25/2018Fawad ShaikhUpdate the Template and Format of the document.0707/09/2018Gang ChenAdded information for NAC1.0 07/10/2018Kellye SheehanMinor editing changesDSA Internal ApprovalsNameRoleDateKellye Sheehan Program Manager 7/10/2018 Modifications made to this plan since the last printing are as follows: Table of Contents TOC \o "1-3" \h \z \u 1Introduction PAGEREF _Toc518987896 \h 12Overview for Preventive Maintenance PAGEREF _Toc518987897 \h 23Scope PAGEREF _Toc518987898 \h 34DAS Presentation Tier PAGEREF _Toc518987899 \h 44.1DAS Service Tier PAGEREF _Toc518987900 \h 44.2DAS Data Tier Plan PAGEREF _Toc518987901 \h 54.3DAS Infrastructure Plan PAGEREF _Toc518987902 \h 64.4DAS Preventative Maintenance Schedule PAGEREF _Toc518987903 \h 65NAC System Preventive Maitenance PAGEREF _Toc518987904 \h 85.1NAC System Presentation Tier PAGEREF _Toc518987905 \h 85.1.1Daily Monitoring PAGEREF _Toc518987906 \h 85.1.2Weekly Monitoring PAGEREF _Toc518987907 \h 85.2NAC Service Tier PAGEREF _Toc518987908 \h 95.2.1Daily Monitoring PAGEREF _Toc518987909 \h 95.2.2Weekly Reports PAGEREF _Toc518987910 \h 95.3NAC Data Tier Plan PAGEREF _Toc518987911 \h 95.4NAC Infrastructure Plan PAGEREF _Toc518987912 \h 105.5NAC Preventative Maintenance Schedule PAGEREF _Toc518987913 \h 11 TOC \o "2-3" \h \z \t "Heading 1,1" IntroductionThis document contains or points to, content describing the following material:Preventive MaintenanceIncident ManagementRoutine MaintenanceSection A.The National Archives and Records Administration (NARA) and Data Systems Analysts (DSA) have established and agree upon a policy for how the contractor would handle a ‘Preventive Maintenance Window’. This is a regular period of time that should be used when scheduling planned outages to services. The Preventive Maintenance window is not intended to be used for emergency work, or to negatively impact the timing of responses necessary to handle critical and significant incidents. This document will contain the material which pertains to the Preventive Maintenance Window. Section B.The content for Incident Management already exists in a separate document called “Incident Management Plan”. The most recent version of this document was provided to the Government on 4/19/2018.Section C.Regarding Routine Maintenance documentation. DSA is already providing separate weekly O&M Release reports describing our ongoing plans regarding Routine Maintenance; and the List of Known Defects is reported weekly out of TFS as part of our Routine Maintenance documents. These are being provided weekly.In addition, we regularly provide a document called Scheduled Maintenance Window Document which also relates to Routine Maintenance. The most recent version of the Scheduled Maintenance Window Document was provided to the Government on 4/19/2018.Overview for Preventive MaintenanceDSA has developed a regularly scheduled preventive maintenance plan and schedule.?The purpose of this section is to assure the continued and prolonged operational health for both DAS and NAC system environment. In addition, with the accomplished AWS server migration from PV to HVM, improvement for the reliability and stabilization of system and applications will also be observed and overviewed progressivle.This maintenance plan includes high-level descriptions of systems that will be monitored and maintained that include the system presentation tier, service tier, data tier, infrastructure and the Amazon Web Services (AWS) environment hosting the application components.?ScopeFor DAS system, the system presentation tier, service tier, and data tier environments will have two established checklists.?The first set of checklists is to be executed nightly (including on all weekends) and will be automated, moving forward.?The second set of checklists is for manual execution on the 1st and 3rd weekends of every month.? This is a more thorough checklist with enhanced assessments involved in its completion.?The DAS infrastructure tier will have one checklist, which will be performed on the 1st and 3rd weekend of every month.For NAC system, it would establish both daily monitoring processes and weekly reports to check system presentation tier, service tier, data ingestion, and digital object storage periodically and automatically. With retirement of PRTG monitoring system, AWS cloudwatch is implemented for PRTG replacement to monitor hardware utilization and health check at system tier. Splunk will continue serving to collect the system log information daily and generate weekly reports. Validation for data ingestion and Lambda process will be implemented. Backing up the system and service logs will also be automated. DAS Presentation Tier?Nightly Verifications?Verify the continual operational status of the Apache web server application that hosts the DAS UI deployments?This includes verifying the operating integrity of Apache as well as analysing Apache logs for any errors?Verify the continual operational status of the Apache server infrastructure hosting the DAS UI?This includes verifying the operating integrity of the Red Hat server that powers the Apache web server application as well as analysing the AWS instance for defects or scheduled maintenance.?1st & 3rd Weekend Verifications?Verify the continuity of the Apache web server’s Amazon Machine Image (AMI)?This includes comparing the running copy of the server to the image backup on file?A new server image will be created if differences are found?DAS Service Tier??Nightly Verifications?Verify the continual operational status of the JBoss platform application environment?This includes verifying the operating integrity of the JBoss Java application environment (JVM)?Verify the continual operational status of the JBoss platform logging processes.?This includes verifying the operation status of the log rotate process that manages daily JBoss logging?Verify the storage of JBoss platform log files?This includes verifying the JBoss platform application logs are being uploaded to an offsite permanent storage location.?1st & 3rd Weekend Verifications?Verify the health of the Jboss SOA platform?This includes checking for possible memory leaks and investigating any errors discovered in the JBoss application logs or the logging process.?Verify connectivity to the NARA Enterprise LDAP (eLDAP) system?Verify the authenticity of the NARA eLDAP SSL certificate and compare to currently installed certificate.? This is intended to try to catch any certificate changes before they impact the NARA DAS users.?Verify the continuity of the Jboss SOA platform Amazon Machine Image (AMI)?This includes comparing the running copy of the server to the image backup on file?A new server image will be created if differences are found?This does not include deployments as those are stored and maintained elsewhere, then delivered to the application server.??DAS Data Tier Plan?Nightly Tasks?Oracle Database will be completely re-indexed to support normal business operations and migrate imported or modified data into the primary database tables.?This includes rebuilding the Oracle context indexes on large tables.?Oracle Database logging will be verified and managed.?This includes moving logs no longer required by the Oracle Database software to be migrated off of the Oracle server to a permanent storage location.?Verify Oracle Database server volume health.?This includes comparing the storage capacity of the Oracle server’s storage volumes against the current storage utilization data.?This also includes analysing the accrual of Oracle data as compared to the data volume capacity to ascertain the health of the volume and volume data accrual rates.?This will assist PPC with identifying any volume health or capacity issues before they become problematic.?1st & 3rd Weekend Verifications?Oracle database architectural health will be verified?Health of Oracle ASM (Automatic Storage Management) and data volumes will be assessed?Oracle table partitions will be assessed and partitions will be added if existing partitions are filling up with data?Any alerts which have been issued by the database will be assessed?These include the following items:?Health of database software?Health of ASM instance?Health of table partitions?Verify the continuity of the Oracle database Amazon Machine Image (AMI)?This includes comparing the running copy of the server to the image backup on file?A new server image will be created if differences are found?DAS Infrastructure Plan1st & 3rd Weekend Verifications?Verify integrity of Red Hat & Oracle Enterprise Linux servers?This includes verifying patches, running processes and monitoring overall system health?Verify integrity of server volumes?This includes verifying the volumes are operating at a high level of quality and efficiency by performing regular read/write and availability tests?Verify that Amazon Machine Images (AMIs) are up to date?This includes launching and testing existing AMIs to verify they function in the current DAS Production operating environment.?Verify Amazon Web Services maintenance schedules and, if required, identify any NARA DAS servers that may be part of AWS scheduled maintenance.?This includes reboots of server instances to accommodate AWS hardware and virtual hypervisor maintenance.?Verify the continuity of the Amazon Machine Image (AMI) infrastructure?This includes spinning up system AMIs and verifying they boot and load into the operating system environment properly?New server images will be created if existing images are found defective?DAS Preventative Maintenance Schedule?Nightly Maintenance Window?NARA provides PPC with a nightly maintenance window of 9PM EST to 5AM EST?Nightly maintenance windows provide for activities such as:?Assess any alerts issued by Oracle Enterprise Manager system?Merge imported data into primary table space?Rebuild XML text indexes??Perform a backup of the database tables?Verify the health of the database software?Verify the health of the database operating system?Verify the health of the Jboss SOA platform software?Verify the health of the Jboss servers and operating systems?Weekend Maintenance Window?NARA provides PPC with a weekend maintenance window on the 1st and 3rd weekend of every month, starting at 9PM EST Friday, continuing to 5AM EST Monday?It is possible that the OPA export cannot be performed during the weekend maintenance.? If PPC is unable to perform the OPA export, this must be communicated to the NARA DAS Project Manager and NARA DAS Program Manager at least two weeks in advance for authorization from the NARA PM and the NARA Program Manager.?Weekend maintenance windows provide opportunities for PPC to execute nightly maintenance activities as well as more extensive maintenance actions, such as:?Assess and implement solutions to alerts issued by the Oracle Enterprise Manager system?Management of the ASM instance?Management of the table partitions?NAC System Preventive MaintenanceNAC System Presentation TierDaily Monitoring?Two AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC system informationTime Interval: 5 minutes; Duration: 1h – 15 monthsServer Instances:pw01 – pw04; pa01 – pa04; ps01 – ps04;AWS/Lambda service functions and AWS/Queue:metadata-extraction; vips-processing; lambda-dlq-prod; Verify the continual operational status of the NAC PROD system to visually check if entire PROD system (including web servers, API servers, Solr search engine servers, and AWS/Lambda services) is in health statusWill be moving forward to extending to other PROD servers (such as database, Lambda app, and content processing servers), UAT and DEV servers.Splunk Monitor Console can provide the disk space usage for all 17 PROD servers dailyTime Interval: 2 hours;Server Instances:pw01 – pw04; pa01 – pa04; pdb01 – pdb02; pcp01 – pcp02; ps01 – ps04; pl01This verifies the disk space usage for each PROD servers so that enough disk space will be maintained for data or logs.Shutdown UAT 17 server instances daily to reduce the server maintenance and costStop Time: 9pm – 6am; Start Time: 6am – 9pmServer Instances:uw01 – uw04; ua01 – ua04; udb01 – udb02; ucp01 – ucp02; us01 -us04; ul01Weekly Monitoring?Every week all PROD and UAT servers are inspected with checking their healthy status, memory and CPU utilization, and disk space usageThis includes to create AMIs (Amazon Machine Images) for each server on each FridayA new server AMI will be created on needed base due to server configuration changes (such as network configuration or security group)Each Friday review and inspect AWS notificationsVerify any impact with AWS service or hardware retirement plan or schedulePlan or schedule the modification for NAC system along with AWS changes?NAC Service TierDaily Monitoring?Daily Manual Verifications?Verify NAC web application on each PROD web server (pw01, pw02, pw03, pw04) with health check via squid proxy service;Verify NAC API application on each PROD API server (pa01, pa02, pa03, pa04) with testing query check via squid proxy service;Verify NAC Solr service on each PROD solr server (ps01, ps02, ps03, ps04) with Apache Solr admin utility tool to check the status of solr nodes on both shard1 and shard2;Verify NAC content processing service on each PROD cp server (pcp01, pcp02) with Aspire utility tool to check the status of DAS Feeder and Annotation services;During the work hours (i.e. 6am – 6pm), perform the above verification steps (i, ii, iii, iv) for UAT environment;Weekly Reports?Weekly Manual Verifications?Verify weekly data content process with following Splunk reportsAuthority Records in the last weekDescriptions with Digital Objects in the last weekDescriptions without Digital Objects in the last weekWebpages (, Presidential Libraries) in the last weekWeekly Reports Sent to NARA ClientsThere are total 16 Splunk reports to be sent out each week for NARA staff and clients to review and verify?NAC Data TierWeekly Routine Data Ingestion?Post validation for each weekly NAC data ingestion?Verify and analyse the counts for completed and error after each weekly data ingestion;Validate the ingested data records in Solr server with scripts (TBD);Validation for Digital Object Url LinksVerify the valid or invalid url links for digital objects via API after weekly data ingestionVerify the existence for digital objects in AWS storage buckets;On-Demanded Data Ingestion?Recover the missing data during NAC data ingestionRecover the missing data during DAS export process?NAC Infrastructure Plan1st & 3rd Weekend Verifications?Verify integrity of Red Hat & Amazon Enterprise Linux servers?This includes verifying patches, running processes and monitoring overall system health?Verify integrity of server volumes?This includes verifying the volumes are operating at a high level of quality and efficiency by performing regular read/write and availability tests?Verify that Amazon Machine Images (AMIs) are up to date?This includes launching and testing existing AMIs to verify they function in the current NAC Production operating environment.?New server images will be created if existing images are found defective?NAC Preventative Maintenance ScheduleMonthly System Patch for each server on each NAC environmentNightly Maintenance Window?NARA authorizes DSA with a nightly maintenance time window between 9PM EST and 5AM ESTNightly maintenance time windows provide time for preventive maintenance activities, which can include any of the following activities:Refresh the servers to flush server memory or cache issues;Flush tcp/ip ports to clean up the network traffic;Compress the historical logs or data to release certain disk space;Add necessary disk spaces. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download