Tigerbeetle's Storage Fault Model

Tigerbeetle's Storage Fault Model

by LAC-Tech

throw0101b

> Disk performance and read and write latencies can sometimes be volatile, causing latency spikes on the order of seconds.

See Brendan Gregg (and Bryan Cantrill) shouting at hard drives for Sun Microsystem's Fish Works project (ZFS for data and DTrace for instrumentation):

* https://www.youtube.com/watch?v=tDacjrSCeq4

Retrospectives on the video:

* https://www.youtube.com/watch?v=lMPozJFC8g0

* https://www.youtube.com/watch?v=_IYzD_NR0W4

jorangreef

ZFS was an inspiration for TigerBeetle. I love this talk by Jeff Bonwick and Bill Moore:

“ZFS: The Last Word in File Systems”

https://www.youtube.com/watch?v=NRoUC9P1PmA

allknowingfrog

The "Storage Fault Model" section is about halfway down the page.

https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/DE...

LAC-Tech

Thanks, looks like HN truncated the hash fragment when I submitted it.

baq

> TigerBeetle detects and repairs disk corruption (3.45% per 32 months, per disk), detects and repairs misdirected writes where the disk firmware writes to the wrong sector (0.042% per 17 months, per disk)

I'm kind of speechless actually, both at the fact that it can do that and the fact that disk firmwares are actually so bad

morelisp

These numbers didn't ring true at all to me, even for spinning disks. And indeed, I don't think the documentation presents them with correct context - I can't figure out what's "per disk".

> A total of 3.45% of 1.53 million disks developed latent sector errors over a period of 32 months.

IOW, over 2-3 years 3.5% of disks will develop at least one error. Also,

> For most disk models, more than 80% of disks with latent sector errors have fewer than 50 errors.

jorangreef

Thanks! (and well spotted!)

You're right, we had them as per disk instead of as a percentage of disks. The link was also to Bairavasundaram's Latent Sector Error analysis study instead of to their follow-up corruption analysis study (https://www.usenix.org/legacy/events/fast08/tech/full_papers...), which is what we meant.

I've also clarified whether these are for Enterprise or Nearline HDD. There are other studies also for flash, as well as filesystems, but Bairavasundaram was used here for illustration because of the large scale of the data, as well as the wide variety of faults covered (from LSEs through to corruption) all side by side.

Here's the update:

  TigerBeetle recovers correctly from Latent Sector Errors
  (e.g. 1.4% of Enterprise HDD disks per year on average)
  detects and repairs disk corruption or misdirected I/O
  where firmware reads/writes the wrong sector (e.g. 0.466%
  of Nearline HDD disks per year on average), and detects
  data tampering with hash-chained cryptographic checksums.

Would appreciate if you think we can make this still clearer.

jorangreef

It's possible thanks to “Protocol-Aware Recovery for Consensus-Based Storage” ('18), which TigerBeetle uses to leverage the global redundancy available in the consensus protocol to recover from storage faults in the local storage engine (it's at least 3x more cost-efficient to do this, and stronger compared to logical RAID, if you already have replicated durability that you can tap into like we do).

UW-Madison have done some terrific research on disk firmware (and filesystem) bugs. As a recent example, there was also a bug in XFS in May that could result in misdirected writes IIRC: https://bugzilla.redhat.com/show_bug.cgi?id=2208553.

With the storage fault model clearly defined, TigerBeetle's storage fault model can then be tested, with storage fault injection on the read/write path, but at much higher fault probabilities, to see how storage faults interact with the storage engine and consensus protocol.

For example, 8-9% chance per I/O on the read and write path, and with the simulator aware of "f", i.e. how many storage faults the simulator can inject across replicas while expecting (and asserting) the consensus protocol to remain available.

We normally run our simulator on the command line, but as a fun hack, we used Zig to compile TigerBeetle to WASM and then drew graphics to hook into the real events, so you can see a whole simulated virtual cluster running purely client side in a browser tab: https://tigerbeetle.com/blog/2023-07-11-we-put-a-distributed...

throw0101b

> I'm kind of speechless actually, both at the fact that it can do that and the fact that disk firmwares are actually so bad

The Sun/Solaris ZFS folks used to talk about this a lot in the early days as a way of evangelizing the idea of checksums covering everything in the file system. Bryan Cantril has given a number of talks/rants on firmware, e.g., "Zebras All the Way Down":

* https://www.youtube.com/watch?v=fE2KDzZaxvE

jorangreef

I've always loved ZFS' design decision to use checksums for end-to-end integrity all the way down, and this made an impact on TigerBeetle.

Crafted by Rajat

Source Code

hckrnws

Tigerbeetle's Storage Fault Model