This article was co-authored with Luke Sudgen, Post-Trade Principal DevOps Engineer, and Padraig Murphy, Post-Trade Solutions Architect, from the London Stock Exchange Group.

In this article, we will discuss some failure scenarios that have been tested by London Stock Exchange Group (LSEG) Post Trade Technology teams at an AWS-supported Chaos Engineering event. Chaos engineering allows LSEG to simulate real-world failures in its cloud systems in controlled experiments. This methodology improves resiliency and observability, reducing risk and enabling regulatory compliance before deployment to production.

Introduction, tools and methodology

As a heavily regulated provider of global financial market infrastructure, LSEG is always looking for opportunities to improve workload resilience. LSEG and AWS partnered to organize and execute a 3-day AWS Experience-Based Acceleration (EBA) event to perform chaos engineering experiments on key workloads. The event was sponsored and led by the architecture function and included cross-functional technical post-exchange teams in various work areas. The experiments were run using AWS Fault Injection Service (FIS) following the experimentation methodology described in the blog post Verify the resiliency of your workloads using Chaos Engineering.

The resilience of modern distributed cloud systems can be continually improved by examining workload architectures and recovery, evaluating standard operating procedures (SOPs), and creating SOP alerts and recovery automations. AWS Resilience Hub provides a comprehensive suite of tools to start these activities.

Another key activity to validate and improve your resilience posture is chaos engineering, a methodology that induces controlled chaos into client systems through controlled real-world experiments. Chaos engineering helps customers create real-world failure conditions that can uncover hidden bugs, monitor blind spots, and manage hard-to-find bottlenecks in distributed systems. This makes it a very useful tool in regulated industries such as financial services.

Architectural overview

The architectural diagram in Figure 1 includes a three-tier application deployed in virtual private clouds (VPC) with a multi-AZ configuration.

Operating in a public subnet, the web application creates a hybrid architecture by using an Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling group and connecting to an Amazon Relational Database Service (Amazon RDS) database located in a private subnetwork and connected to local services. Additionally, a number of internal services are hosted in a separate VPC, hosted in containers. FIS provides a controlled environment to validate the robustness of the architecture against various failure scenarios, such as:

Amazon EC2 instance failure causing the application or container pod on the machine to also fail

Restarting or Failover the Amazon RDS DB Instance

Severe network latency degradation

Network connectivity disruption

Amazon Elastic Block Store (Amazon EBS) volume fails (IOPS pause, disk full)

Amazon EC2 instance and container fail

The goal of this use case is to assess the resiliency of the application or container pod running on Amazon EC2 instances and identify how the system can adapt and continue operating in the event of unexpected disruptions or instability of an instance. You can use aws:ec2:stop-instances Or aws:ec2:terminate-instances FIS actions to mimic different failure modes of EC2 instances. The response of running containers to different instance failures was also evaluated. If you run containers within a managed AWS service such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS), you can use FIS failure scenarios for ECS jobs and pods EKS.

Amazon RDS fails

RDS failure is another common scenario that you can use to identify and troubleshoot managed database service failures due to large-scale node failovers and restarts. FIS can be used to inject restart/failover failure conditions into managed RDS instances to understand bottlenecks and issues related to disaster failovers, sync failures, and other failure-related issues. database.

Severe network latency degradation

Network latency degradation injects latency into the network interface that connects two systems. This helps you understand how these systems handle a data transfer delay and your preparedness for operational response (alerts, metrics, and remediation). This FIS action ( aws:ssm:send-command/AWSFIS-Run-Network-Latency ) uses the Linux traffic control (tc) utility.

Network connectivity disruption

Connectivity issues such as traffic disruptions or other network issues can be simulated with FIS network actions. The FIS supports aws:network:disrupt-connectivity action to test the resiliency of your applications in the event of full or partial loss of connectivity within its subnet, as well as disruption (including cross-region) with other AWS network components such as routing tables or AWS Transit Gateway.

Amazon EBS volume fails (IOPS pause)

Disk failure is a problematic issue in systems based on real-time operations. This can cause transactions to fail due to I/O or storage failures during peak activity of heavy workloads. EBS volume failure actions test system performance under different disk failure scenarios. The FIS supports aws:ebs:pause-volume-io action to suspend I/O operations on target EBS volumes, as well as other failure modes. Target volumes must be in the same Availability Zone and must be attached to instances created on the AWS Nitro system.

Results and conclusion

As a result of the experiment, the LSEG teams managed to identify a series of architectural improvements aimed at reducing application recovery time and improving the granularity of metrics and alerts. Second tangible result: teams now have a reusable chaos engineering methodology and toolset. Hosting regular in-person cross-functional events is a great way to implement a chaos engineering practice in your organization.

You can start your resilience journey on AWS today with AWS Resilience Hub.