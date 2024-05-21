



To ensure that the Spanner database continues to work reliably, Google engineers use chaos testing to inject failures into production-like instances to ensure that the system behaves correctly even in the face of unexpected failures. It emphasizes ability.

As Google engineer James Corbett explains, Spanner is built on a solid foundation of low failure rates provided by machine, disk, and network hardware. However, this alone is not enough to guarantee proper operation under all conditions, including data corruption due to bad memory or disks, network failures, and software errors.

According to Corbett, the use of fault-tolerant techniques, such as checksums to detect data corruption, data replication, and the use of the Paxos algorithm for consensus, are key to masking failures and achieving high reliability. But to ensure everything works as expected, these techniques need to be practiced and proven effective. That's where chaos testing comes in. It consists of intentionally injecting failures into a production-like instance at a much higher rate than would occur in production.

We run more than 1,000 system tests per week to verify that Spanner's design and implementation actually hides faults and provides reliable service. Each test runs on the same computing platform as the production Spanner and consists of several hundred processes that use the same dependent systems (file system, locking services, etc.) as the production Spanner. A new Spanner instance is created.

The injected faults belong to several categories, including server crashes, file faults, RPC faults, memory/quota faults, and cloud faults.

A server crash can be caused at any time by sending a SIGABRT signal to trigger the recovery logic. This involves aborting all distributed transactions coordinated by the crashed server and forcing all clients accessing that server to fail over to another server to use the disk. Avoid memory-only data loss based on logs of all operations.

File faults are injected by intercepting all file system calls and randomly modifying their results. For example, it may return an error code, corrupt the content when read or written, or not return to trigger a timeout.

Another area where a similar approach is taken is interprocess communication using RPC. In this case, RPC calls are intercepted, delayed, and error codes are returned to simulate network partitions, remote system crashes, or bandwidth throttling.

When it comes to memory failures, the Spanner team focuses on two specific behaviors. One is a simulation of a pushback situation where the server becomes overloaded and clients start redirecting requests to less busy replicas, and the other is his leaking enough memory that the process is killed. Similarly, it simulates “quota exceeded” errors, whether due to per-user disk space, memory, or flash storage.

Cloud fault injection is intended to test unusual conditions related to Spanner API front-end servers. In this case, the Spanner API front end server crashes and client sessions are forced to migrate to other Spanner API front end servers to ensure that there is no impact to the client other than additional delays.

Finally, Google engineers also simulated an entire region becoming unreachable due to possible causes such as file system or network outages, allowing Spanner to serve data from a quorum of other regions according to the Paxos algorithm. Force.

Corbett concludes that this approach, based on fault-tolerant design combined with continuous chaotic testing, allows Google to effectively validate Spanner's reliability.

