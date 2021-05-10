



Google has devised a machine learning (ML) model that predicts disk failures with 98% accuracy. The idea is to reduce the data recovery work when the disk actually fails.

According to Google’s blog by technical program manager Nitin Agarwal and AI engineer Rostam Dinyari, Google manages millions of hard disk drives (HDDs), some of which are out of order. Failure to identify these failures in a timely manner can result in serious outages for many products and services.

When a non-fatal problem occurs on a Google data center disk other than the actual crash, the data is read (extracted) from the drive. The drive is then disconnected from production use, diagnostics applied, modified and put back into production. Google wanted a way to accurately predict such disk failures to avoid wasting time.

We worked with leading HDD providers Seagate and Accenture consultants to build a disk failure prediction machine learning model. It was based on the two most common drive types using Google Cloud. The goal was to predict the probability of a recurring failure disk (a disk that failed in 30 days or a disk that had three or more problems).

Data must be supplied for the model to work. The engineer not only built the ML model, but also set up an automatic data path where the HDD telemetry data was moved to the vault and sent to the model.

Hundreds of parameters (terabytes) of source HDD data were taken from billions of rows of drive-level SMART (self-monitoring, analysis, and reporting technology) data read hourly. It was accompanied by host system data, repair logs, online vendor diagnostic (OVD) or field-accessible reliability metric (FARM) logs, and manufacturing data for each disk drive.

Google’s data pipeline mechanism for production models of machine learning disk failures

We used the DevOps approach to merge the data pipeline into the model and feed the data. This process was called MLOps. I used Google products and services such as AutoML Tables, Terraform, Tensorflow, BigQuery and Dataflow. Two models have been devised, a custom deepTransformer-based model using the AutoMLTables classifier and Tensorflow.

The AutoML table design used time-series feature aggregations such as disk minimum, maximum, and average read error rates. This was tied to non-chronological features such as drive model type.

The alternative Transformer model uses a direct feed of raw time series data. Non-time series data is fed into a deep neural network, which is concatenated with the output of the Transformer model and used to predict potential failures.

When the model is expanded, those predictions are saved and compared to the actual drive repair log after 30 days. The AutoML model achieved 98% accuracy compared to 70-80% accuracy with a custom Transformer / deep neural network design.

The blog author said: Explain the model by identifying the main reasons behind recurring failures and enabling ground teams to take proactive action to reduce operational failures before they occur. I was able to do that.

Elias Glavinas, Director of Tools & Automation, Quality Data Analytics at Seagates, said: ‘Manual effort. In addition to that, there was a simple and automated model retraining and deployment feature that turned out to be a very successful project. “

The engineer said: “The business case of predicting HDD failures using ML-based systems is becoming stronger. If engineers have a larger window to identify failed disks, they can reduce costs. Instead, you can prevent problems before they affect your end users. There are already plans to extend the system to support all Seagate drives. “

