ENGLISH

Fail-slow at scale: When the cloud stops working

If you’ve ever had a system fail-slow, you know how maddening it is. The lights are on, the fans are running, but nobody is home. Is it software? A background process run amok?

Naturally, finding these problems took a minimum of hours and often days, weeks, or even months. In one case an entire team of engineers was pulled off a project to diagnose a bug, at a cost of tens of thousands of dollars.

Root causes

The paper summarizes the causes of the 101 fail-slow incidents they analyzed. Network problems were the #1 cause, followed by CPU, disk, SSD and memory. Most of the network failures were permanent, while SSD and CPUs had the most transient errors.

Nor does the root cause necessarily rest with the slow hardware, as in the case above where a power-hungry application on some servers caused other servers to slow down. In another case the vendor couldn’t reproduce the user’s high-altitude failure mode at their sea level facility.

The Storage Bits take

Any sysadmin plagued by slowdowns should read this paper. The researcher’s taxonomy and examples are sure to be helpful in expanding one’s vision of what could be happening.

For (one more) example,

In one condition, a fan firmware would not react quickly enough when CPU-intensive jobs were running, and as a result the CPUs entered thermal throttle (reduced speed) before the fans had the chance to cool down the CPUs.

All in all, a fascinating compendium of failure statistics and types. And for those of us who don’t manage large clusters, a welcome sense of many bullets dodged. Whew!

Courteous comments welcome, of course.

Root causes

The Storage Bits take

Related Topics: