What Data Can Tell Us about Hard Drive Reliability

Published on November 16, 2016

The company BackBlaze provides cloud backup and storage services for both personal and business use. You may be surprised to know that these days the size of data centers is measured in MegaWatts (MW) of electricity consumed rather than disk space or square footage of the floor where these computers reside. For example, one of the largest data centers in China currently consumes 150 MW of electricity, which is enough to power over 100,000 homes.
This article analyses nine months of disc failure data from BackBlaze. Arcadia Instant shows the models and brands that tend to be more durable, and the ones that fail more often. In Arcadia Instant, you can visualize all this data at the desired level of granularity, with customizable apps and visuals.

This is the full application:

snapshot

Each visual in this app answers specific questions. Let’s take a look one by one.

Which brand has the most failed drives?
The first visual shows which brands had the most failed drives, and as you can see, Samsung and Seagate have the most failed drives, followed by Hitachi, Western Digital and so on.

screen-shot-2016-08-22-at-11-46-32-am

Which brand has the highest drive failure rate?
It’s not fair to simply look at raw numbers, because the customer may have purchased more drives of a one particular brand and fewer of the other ones. To overcome this problem, data needs to be normalized. The second visual shows the drive failure percentages.

screen-shot-2016-08-22-at-11-47-31-am

In this case, you can now see Western Digital has the most failed drives followed by Seagate.

What’s the number of drive failure incidents per month?
You can see in table format how monthly failed counts are stacking up.

screen-shot-2016-08-22-at-12-12-16-pm

The table shows in month one, they had 1.29 million records. These records are not the number of failed drives but rather the events that occurred within the hard drive. For example, when a drive skips a particular sector the error gets recorded. Each such error triggers an event and gets recorded. So, each drive might have a thousand things that went wrong before it actually completely failed.

Which are the least reliable models?
Let’s also take a look at which models are most prone to failure.

screen-shot-2016-08-22-at-11-45-13-am

This table makes it clear that ST320005XXX from Seagate is the most failed model with a 1.5 percent fail rate. This is substantially higher than the next Seagate model, which has a 0.6 percent failure rate.

What happens to the failed drives?
The next chart shows how Backblaze is handling failed drives. Theoretically, when a drive fails, you want to take it out of commission as soon as possible. That way you don’t lose your customers’ precious data. This chart shows drives that continue to run after failure. Hovering over a particular row in the application shows a tooltip with the number of days the drive ran after failure and additional information about the drive like its serial number, etc. Wherever you see a break, that’s where they stopped running the drive.

screen-shot-2016-08-22-at-11-33-19-am

The chart shows that they’re pretty good at pulling drives for the most part. The top ten rows show the drives that failed were pulled out right away, the day they failed.

However, the next two rows – number 11 and 12 – indicate a couple of instances where they left a failed drive in for three more days, and in some cases, even six more days.

screen-shot-2016-08-22-at-11-35-24-am

This goes on to show that they didn’t catch a drive failure for a couple of days – maybe it was a weekend. It is also possible that the failure reporting mechanism did not report the malfunction in a timely manner.
Interestingly, there are also instances when some failed drives were pulled out of commission, and then put back. For instance, row 16 in the following visual shows that the hard drive failed, it was pulled out of the system and put back in after a couple days. It ran for a day or two and then again was pulled out of the system.

screen-shot-2016-08-22-at-11-36-52-am

Maybe they changed one of the components of that hard drive and then put it back into production, but then, pretty soon after the drive failed again and the drive had to be decommissioned.
There are also instances where the failed hard drives were successfully fixed and the repaired drive didn’t fail for another six months. Take a look at the second, third, and fourth rows from the bottom in the following visual.

screen-shot-2016-08-22-at-11-38-21-am

The drive lasted another six months after the fix, but it failed again and at that point in time the drive was decommissioned.

Raw Smart Metrics Leading to Failure
This standalone visual is showing those ‘smart metrics’. These are the events that occurred within the lifespan of a hard drive. You can think of them as error codes. As you can see here, most of the time, these error codes don’t occur that often. The X-Axis here is counting backward from the day of failure. That means zero is when the drive fails, and 1 is one day before failure, 2 is two days before failure and so on.

screen-shot-2016-08-22-at-11-52-01-am

There are a couple of items that stand out, for example, SMART 198. The chart clearly shows that error code 198 tends to shoot up for all hard drives about two days before drive failure. Error number 197 also shoots up about three days before failure. So, now, based on this analysis, it can be predicted that if a hard drive is throwing error 197 for two days, it’s most likely going to fail in the very near future. This can help predict which drives are likely to fail. Using this data, the probability of a customer losing data because of drive failures can be reduced by pulling the drives when they throw these particular error codes.

Custom Visualization Options
Lastly, this is an overview page that has some customized elements to it.

screen-shot-2016-08-22-at-11-49-36-am

As you can see, the font color and the font type are changed in these visuals; the dashboard also has a fixed width, so when you change its size, it isn’t affected. All the elements within the application are bounded by a fixed width. Features like that show how extensible the Arcadia Data platform is.
Also, the scatter dot plot on the top right and at the bottom show the range of different charts and graphs that can be created in Arcadia Instant.
These are just a few of the various built-in visualization options possible with Arcadia Instant. However, you can add many other types of visualization by leveraging JavaScript and CSS. You can choose from a variety of chart types like bar charts, trend lines, pie charts, area graphs, geographical maps, heat maps etc. You can also do extensions of external elements from other HTML codes that you may have as well as throwing in raw SQL.
To try your hand at this versatile visual analytics software, download it for free here.  The data used in this analysis came from data publicly available on the BackBlaze website.