RAMP-TAO Transactions

Layering Atomic Transactions on Facebook’s Online Graph Store

Audrey Cheng · August 18, 2021

What the research is

RAMP-TAO is a new protocol that improves the developer experience on TAO, Facebook’s online social graph store, by providing stronger transactional guarantees. It is the first protocol to provide transactional semantics over an eventually consistent massive-scale data store while still preserving the system’s overall reliability and performance. RAMP-TAO enables an intuitive read transaction API, so developers do not have to investigate and handle rare anomalies.

Our research has demonstrated that RAMP-TAO can be feasibly deployed in production with 0.42 percent memory overhead and enables the vast majority (over 99.9 percent) of reads to complete in one round-trip to the local cache with tail latency on par with existing TAO reads.

How it works

RAMP-TAO draws inspiration from the Read Atomic Multi-Partition (RAMP) protocols, which prevent fractured reads but impose unacceptably high performance and storage overheads for TAO.

The RAMP-TAO protocol provides stronger guarantees by layering them on top of TAO and is similar in spirit to the “bolt-on” approach. Since TAO, like most systems, ensures that all data is eventually consistent, we only need to guard against fractured reads for recent, transactionally updated data. We leverage this knowledge to minimize performance and storage overheads, especially for the existing TAO workloads that are not transactional.

Why it matters

TAO serves billions of reads and millions of writes per second and supports many of Facebook’s applications. It has traditionally prioritized availability and scalability to serve its large, read-dominant workloads. By layering stronger transactional guarantees on top of the existing system, we can provide more intuitive system behavior while retaining its reliability and efficiency. This strategy also enables RAMP-TAO to be cache-friendly, hotspot tolerant, and extensible to different data stores. Moreover, we incur overhead only for applications that opt in, rather than causing every application to take a performance hit.

Although we focus on TAO in this work, these properties are crucial to large-scale, read-optimized systems. Our layered approach can be a practical solution for other systems, many of which have sought to strengthen their guarantees.

For more details

Check out our VLDB paper.

Note this blog is also posted on Facebook’s engineering site.