AWS Well-Architected Framework

[Pages:277]AWS Well-Architected Framework

Reliability Pillar

Copyright ? 2024 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Reliability Pillar

Reliability Pillar: AWS Well-Architected Framework

AWS Well-Architected Framework

Copyright ? 2024 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon.

Reliability Pillar

Table of Contents

AWS Well-Architected Framework

Abstract and introduction ............................................................................................................... 1 Introduction ................................................................................................................................................... 1

Reliability ......................................................................................................................................... 3 Shared Responsibility Model for Resiliency ............................................................................................ 3 Design principles ........................................................................................................................................... 6 Definitions ...................................................................................................................................................... 7 Resiliency, and the components of reliability ................................................................................... 8 Availability ................................................................................................................................................ 8 Disaster Recovery (DR) objectives ..................................................................................................... 12 Understanding availability needs ........................................................................................................... 13

Foundations .................................................................................................................................... 15 Manage service quotas and constraints ................................................................................................ 15 REL01-BP01 Aware of service quotas and constraints ................................................................. 16 REL01-BP02 Manage service quotas across accounts and regions ............................................. 21 REL01-BP03 Accommodate fixed service quotas and constraints through architecture ........ 25 REL01-BP04 Monitor and manage quotas ...................................................................................... 29 REL01-BP05 Automate quota management ................................................................................... 33 REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover ..................................................................................... 34 Plan your network topology .................................................................................................................... 38 REL02-BP01 Use highly available network connectivity for your workload public endpoints ................................................................................................................................................ 39 REL02-BP02 Provision redundant connectivity between private networks in the cloud and on-premises environments .................................................................................................................. 44 REL02-BP03 Ensure IP subnet allocation accounts for expansion and availability .................. 47 REL02-BP04 Prefer hub-and-spoke topologies over many-to-many mesh .............................. 50 REL02-BP05 Enforce non-overlapping private IP address ranges in all private address spaces where they are connected ..................................................................................................... 52

Workload architecture ................................................................................................................... 55 Design your workload service architecture ........................................................................................... 55 REL03-BP01 Choose how to segment your workload .................................................................. 56 REL03-BP02 Build services focused on specific business domains and functionality ............. 59 REL03-BP03 Provide service contracts per API .............................................................................. 63 Design interactions in a distributed system to prevent failures ....................................................... 66

iii

Reliability Pillar

AWS Well-Architected Framework

REL04-BP01 Identify which kind of distributed system is required ........................................... 67 REL04-BP02 Implement loosely coupled dependencies ............................................................... 68 REL04-BP03 Do constant work .......................................................................................................... 72 REL04-BP04 Make all responses idempotent ................................................................................. 74 Design interactions in a distributed system to mitigate or withstand failures .............................. 75 REL05-BP01 Implement graceful degradation to transform applicable hard dependencies into soft dependencies ........................................................................................................................ 76 REL05-BP02 Throttle requests ........................................................................................................... 79 REL05-BP03 Control and limit retry calls ........................................................................................ 83 REL05-BP04 Fail fast and limit queues ............................................................................................ 86 REL05-BP05 Set client timeouts ....................................................................................................... 89 REL05-BP06 Make services stateless where possible .................................................................... 93 REL05-BP07 Implement emergency levers ..................................................................................... 95 Change management .................................................................................................................... 98 Monitor workload resources .................................................................................................................... 98 REL06-BP01 Monitor all components for the workload (Generation) ....................................... 99 REL06-BP02 Define and calculate metrics (Aggregation) .......................................................... 102 REL06-BP03 Send notifications (Real-time processing and alarming) .................................... 104 REL06-BP04 Automate responses (Real-time processing and alarming) ................................ 107 REL06-BP05 Analytics ....................................................................................................................... 111 REL06-BP06 Conduct reviews regularly ........................................................................................ 112 REL06-BP07 Monitor end-to-end tracing of requests through your system .......................... 114 Design your workload to adapt to changes in demand ................................................................... 117 REL07-BP01 Use automation when obtaining or scaling resources ......................................... 117 REL07-BP02 Obtain resources upon detection of impairment to a workload ....................... 121 REL07-BP03 Obtain resources upon detection that more resources are needed for a workload ............................................................................................................................................... 123 REL07-BP04 Load test your workload ........................................................................................... 124 Implement change ................................................................................................................................... 126 REL08-BP01 Use runbooks for standard activities such as deployment ................................. 127 REL08-BP02 Integrate functional testing as part of your deployment ................................... 128 REL08-BP03 Integrate resiliency testing as part of your deployment ..................................... 129 REL08-BP04 Deploy using immutable infrastructure ................................................................. 130 REL08-BP05 Deploy changes with automation ........................................................................... 135 Failure management ................................................................................................................... 138 Back up data ............................................................................................................................................. 139

iv

Reliability Pillar

AWS Well-Architected Framework

REL09-BP01 Identify and back up all data that needs to be backed up, or reproduce the data from sources ............................................................................................................................... 139 REL09-BP02 Secure and encrypt backups .................................................................................... 143 REL09-BP03 Perform data backup automatically ....................................................................... 145 REL09-BP04 Perform periodic recovery of the data to verify backup integrity and processes .............................................................................................................................................. 147 Use fault isolation to protect your workload .................................................................................... 152 REL10-BP01 Deploy the workload to multiple locations ........................................................... 152 REL10-BP02 Select the appropriate locations for your multi-location deployment ............. 158 REL10-BP03 Automate recovery for components constrained to a single location .............. 162 REL10-BP04 Use bulkhead architectures to limit scope of impact .......................................... 164 Design your workload to withstand component failures ................................................................ 168 REL11-BP01 Monitor all components of the workload to detect failures .............................. 168 REL11-BP02 Fail over to healthy resources .................................................................................. 172 REL11-BP03 Automate healing on all layers ................................................................................ 175 REL11-BP04 Rely on the data plane and not the control plane during recovery .................. 179 REL11-BP05 Use static stability to prevent bimodal behavior ................................................. 183 REL11-BP06 Send notifications when events impact availability ............................................. 187 REL11-BP07 Architect your product to meet availability targets and uptime service level agreements (SLAs) .............................................................................................................................. 189 Test reliability ........................................................................................................................................... 192 REL12-BP01 Use playbooks to investigate failures ..................................................................... 193 REL12-BP02 Perform post-incident analysis ................................................................................ 195 REL12-BP03 Test functional requirements ................................................................................... 198 REL12-BP04 Test scaling and performance requirements ......................................................... 199 REL12-BP05 Test resiliency using chaos engineering ................................................................. 200 REL12-BP06 Conduct game days regularly .................................................................................. 210 Plan for Disaster Recovery (DR) ............................................................................................................ 212 REL13-BP01 Define recovery objectives for downtime and data loss ..................................... 212 REL13-BP02 Use defined recovery strategies to meet the recovery objectives ..................... 218 REL13-BP03 Test disaster recovery implementation to validate the implementation ......... 231 REL13-BP04 Manage configuration drift at the DR site or Region .......................................... 233 REL13-BP05 Automate recovery ..................................................................................................... 235 Example implementations for availability goals ....................................................................... 237 Dependency selection ............................................................................................................................. 237 Single-Region scenarios .......................................................................................................................... 238

v

Reliability Pillar

AWS Well-Architected Framework

2 9s (99%) scenario ........................................................................................................................... 238 3 9s (99.9%) scenario ........................................................................................................................ 240 4 9s (99.99%) scenario ...................................................................................................................... 243 Multi-Region scenarios ............................................................................................................................ 246 3? 9s (99.95%) with a Recovery Time between 5 and 30 Minutes ......................................... 247 5 9s (99.999%) or higher scenario with a recovery time under one minute .......................... 250 Resources ................................................................................................................................................... 254 Documentation .................................................................................................................................... 254 Labs ........................................................................................................................................................ 254 External Links ...................................................................................................................................... 254 Books ..................................................................................................................................................... 255 Conclusion .................................................................................................................................... 256 Contributors ................................................................................................................................. 257 Further reading ............................................................................................................................ 258 Document revisions ..................................................................................................................... 259 Appendix A: Designed-For Availability for Select AWS Services .............................................. 264 Notices .......................................................................................................................................... 270 AWS Glossary ............................................................................................................................... 271

vi

Reliability Pillar

AWS Well-Architected Framework

Reliability Pillar - AWS Well-Architected Framework

Publication date: December 6, 2023 (Document revisions)

The focus of this paper is the reliability pillar of the AWS Well-Architected Framework. It provides guidance to help customers apply best practices in the design, delivery, and maintenance of Amazon Web Services (AWS) environments.

Introduction

The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building workloads on AWS. By using the Framework you will learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud. It provides a way to consistently measure your architectures against best practices and identify areas for improvement. We believe that having well-architected workload greatly increases the likelihood of business success.

The AWS Well-Architected Framework is based on six pillars:

? Operational Excellence ? Security ? Reliability ? Performance Efficiency ? Cost Optimization ? Sustainability

This paper focuses on the reliability pillar and how to apply it to your solutions. Achieving reliability can be challenging in traditional on-premises environments due to single points of failure, lack of automation, and lack of elasticity. By adopting the practices in this paper you will build architectures that have strong foundations, resilient architecture, consistent change management, and proven failure recovery processes.

This paper is intended for those in technology roles, such as chief technology officers (CTOs), architects, developers, and operations team members. After reading this paper, you will understand AWS best practices and strategies to use when designing cloud architectures for reliability. This

Introduction

1

Reliability Pillar

AWS Well-Architected Framework

paper includes high-level implementation details and architectural patterns, as well as references to additional resources.

Introduction

2

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download