tl;dr: Current caching policies optimize for the wrong metric on transactional workloads; we should really be caching the groups of objects that are accessed together.
To improve latency at scale, application developers often layer caching systems, such as Memcached and Redis, over standard data stores. Core to their performance are caching policies that determine which keys to retain in cache. While these algorithms are optimized to serve individual requests, they provide surprisingly poor performance for transactions.
To see this, we evaluate the performance of two popular caching algorithms, least recently used (LRU) and least frequently used (LFU), on a Meta production workload from TAOBench. Specifically, we measure the fraction of cached keys that do not improve overall transaction latency (``unhelpful’’ cached keys). As the figure below shows, up to 95% of cached keys do not help performance.
Standard caching algorithms perform poorly on transactional workloads.
The reason that standard caching algorithms perform poorly is because they optimize for object hit rate (OHR), or how often requested objects can be served from cache. However, this metric is fundamentally flawed for transactions.
Transactions (sets of operations that appear to be executed atomically) often have many operations that can execute in parallel. As a result, they have an implicit all-or-nothing property: all objects requested in parallel must be present in cache, or there will be little performance improvement because latency is dictated by the slowest access.
A transaction does not see any latency benefits unless all keys are cached together.
Consider for example a simple transaction that reads A, B, and C in parallel. If A and B are in cache but C is not, we get no latency improvement because we still need to wait for C to be fetched from disk. Only when A, B, and C are all cached do we see a latency reduction for the transaction. In this case, “hot” (frequently accessed) keys A and B are requested in parallel alongside a “cold” (rarely accessed) key C. In effect, cold keys can “contaminate” (degrade the cacheability of) hot keys.
Cache contamination explains why LRU and LFU performed so poorly on the TAOBench workload. Transactions in this workload access either a combination of hot keys and warm keys, or hot keys and cold keys. Single-object algorithms, which use only individual object features to score keys, retain only hot keys but evict most warm keys and all cold keys. As a result, they achieve few performance improvements on transactions.
Clearly, OHR is insufficient for transactions. We present a new metric, transactional hit rate, that precisely measures how much caching improves transaction latency. A transactional hit occurs when the length of non-cached, sequential accesses in a transaction is reduced by one.
When all keys are cached together, we have a transactional hit.
Intuitively, transactional hit rate represents how often the groups of keys accessed in parallel by transactions are available in cache. This metric captures performance improvements when caching for transactions, much like how its single-object counterpart, object hit rate, does so for individual requests. A formal definition can be found in our OSDI paper.
Traditional heuristics perform poorly for transaction hit rate because they fail to identify the keys that must be cached as a group in order to achieve a transactional hit. This notion of grouping is central to developing a transactionally-aware caching policy.
While grouping keys together might seem simple, there are two significant challenges. First, the number of possible groups can be exponential with respect to transaction size, even for simple transactions. To address this issue, we make the observation that many groups give the same number of transactional hits, so they are equivalent in the context of caching. We introduce interchangeable groups to formally capture this concept (more details in our paper). This optimization allows us to reduce the number of groups we need to consider from exponential to linear in most cases.
Furthermore, grouping requires access to transaction dependency graphs, which are typically extracted via static analysis from application code. However, this information may not always be available (as for the TAOBench workload). Consequently, we design a simplified protocol to dynamically infer groups based on request execution times. Specifically, we consider requests sent in parallel by a transaction to be a group. In practice, many applications batch parallel reads to the cache, which often provides an explicit API to support these requests.
With our insights from grouping, we develop DeToX, the first high-performance caching system that optimizes for transactional hit rate. DeToX’s core component is its caching policy. In accordance with standard caching algorithms, DeToX assigns scores to objects and evicts those with the lowest values. As such, its policy is easily adaptable to existing caching systems.
Previous caching policies use heuristics based on individual object features, such as frequency, recency, and size, to score objects. We use the corresponding group features. DeToX scores groups of keys within transactions and averages group scores to compute individual key scores. We will briefly discuss how we compute group frequency with more details available in our paper.
Group frequency is computed as the minimum frequency of all keys in a group. The minimum frequency is most important because of the all-or-nothing property: unless all the requested items in a group are cached, there will be no performance improvement. Accordingly, the key with the minimum frequency determines the cacheability of the entire group.
Using the minimum frequency allows us to capture the effects of cache contamination. For example, if a high-frequency key X is accessed only with cold keys (each with much lower frequency than X), then it is not beneficial to cache X.
Our system is able to provide significant performance improvements on an experimental setup using Redis as the cache over a database of either Postgres or TiKV. We achieve up to 1.3x increase in transaction hit rate and 3.4x improvement in cache efficiency on a range of real-world workloads and popular OLTP benchmarks. These wins translate into up to 30% improvements in throughput and latency.
DeToX provides significant improvements for THR, cache efficieny, throughput, and latency.
DeToX optimizes caching for transactions by storing the groups of keys that are accessed together. There is a principled reason for its performance: grouping enables DeToX to maximize for transactional hit rate. Please check out our OSDI paper for more details. Happy caching!