To never miss an article subscribe to my newsletter
No ads. One click unsubscribe.

How to lose files with your processor failing to calculate 1.1^53 correctly

2 min read

Moin moin,

this weeks paper discusses how silent hardware failure can lead to actual user facing errors. In the paper Facebook found that some files went missing because of a power function “1.1^53=0” failing due to hardware failures. This failures where never raised anywhere and the system seemed completely healthy. Super interesting to learn about this new error vector for large scale applications.


Abstract:

Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error re-porting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time.In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a data center application.We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions with in a large production fleet.In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of ma-chines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

Download Link:

https://arxiv.org/pdf/2102.11245.pdf


It would be awesome if you could help growing our little paper community even more by sharing it with your circles (you can also @eu_frey me on Twitter for retweets :D):

simon-frey.com/weeklycspaper

If you have any paper recommendation for me, please do not hesitate to approach me via [email protected] (Please keep the Backend & DevOps topic focus in mind)


With much love,

Simon Frey