Enhancing Least Frequently Used Caches with Dynamic Aging

Ben Parli
10 min readApr 6, 2020

--

Its been said cache invalidation is one of the two hard things in Computer Science. Two of the more common cache invalidation policies are the Least Recently Used (LRU) and the Least Frequently Used (LFU) policies, with LRU being the more popular of the two. At last count there were 402 Github repositories tagged as some derivative of LRU (Least Recently Used) class and a mere 36 tagged as some derivative of LFU (Least Frequently Used).

But the LFU policy is important in its own right and fits a wide variety of workloads as well. Indeed, the ever useful Squid proxy used LRU as its only replacement policy option up until version 2.4 when it introduced a variant of LFU called LFU with Dynamic Aging (LFUDA) and Greedy Dual-Size (GDS) as alternative options. Given the above conventional wisdom on the difficulty of cache invalidation, is it curious LRU is often such an obvious choice? I decided to spend some time understanding a couple very interesting LFU variations, namely LFUDA and Greedy Dual-Size with Frequency (GDSF).

Why Least Frequently Used

In a nutshell, when the cache is full LRU decides what to evict based on a given item’s time it was last useful. Meanwhile, (pure) LFU decides based on the overall usage of a given item. Another way to put it is LRU evicts objects not recently accessed and LFU evicts objects which are less popular in the long term.

Oftentimes LRU just makes the most sense. However, its not hard to think of a workload where an LRU cache becomes polluted and LFU would be the more useful of the two. A typical example of thinking about this is to picture a CDN caching web assets. In the above description I said LFU evicts items which are overall less popular in the long term. Items that are continually useful, however, should stay in the cache since their “hits” count will keep climbing over time. A web page asset like a company’s logo (being as I’m writing this during the Coronavirus Pandemic, the Netflix logo seems fitting) is one example. In order to load the page faster, these sorts of assets that are accessed over and over would be best cached.

In contrast, web assets with some initial hype behind them (sticking with the Netflix theme, think Tiger King) may end up useful in the cache for only a short time. But over the long term important assets more frequently accessed may prove more beneficial in the cache. This is the sweet spot of LFU; we can know with certainty the overall most used items will be cached regardless of time. Its great for caching objects with usage patterns that aren’t likely to change much over time.

Side-note on O(1) LFU

Quick side-note on the LFU implementation itself; while the LFU cache is often thought of as a min-heap or red-black tree, an improved implementation uses the O(1) LFU algorithm and that’s whats used in this implementation. To quickly summarize, it uses a hash table and two doubly linked lists with the hash table providing quick access to each individual item. An item’s structure is made up of that item’s actual value, frequency (hits), priority key, and a pointer to that item’s place in a linked list (the freqNode field). The size of the item’s value is also stored in order to track the running size of the overall cache.

type item struct {                        
key interface{}
value interface{}
size float64
hits float64
priorityKey float64
freqNode *list.Element
}

An example O(1) LFU is below. The top list (dark blue, running horizontally) is an ordering of all the current frequencies, referred to as a frequency list in the paper. Each of these nodes, called frequency nodes, in the frequency list contains its own list of items. All items in this frequency node’s list (lighter blue, running vertically) will share the same priority key (frequency).

So in the above example, items Q1, E, P1, and XX all have priority keys of 112 (indicating they’ve each been accessed 112 times).

By the way, why are we referring to priority keys instead of frequencies when they’re set the same? The priority key can be set according to any policy we want, something we’ll see in a bit. In the basic LFU policy the priority key is equal to the frequency.

Back to the example. If some item, lets say A, is accessed again:

  • The item is retrieved from the hash table using the A key
  • Item A’s frequency is updated from 4 to 5. The priority key is also updated
  • Now it no longer belongs on the 4 frequency node; if there was a 5 frequency node it would be added there
  • Since in our example there is not an existing 5 frequency node, we create one and insert it into the frequency list between Key 4 and Key 12.
  • Finally, add item A to that new frequency node’s (5) list and remove it from frequency node 4.

Why LFU with Dynamic Aging

So LFU is great for some workloads, but everything comes with tradeoffs and LFU is no different. LFU runs into problems when an object is so initially hyped, that it manages to make its way towards the top of the LFU list but is not as useful again. If this pattern is such that the hype is a one-time spurt, the cache has now become polluted; that is, the hyped object was useful for some period, but then not needed. The hyped object is now taking up room in the cache; room that could be better used by some other item which is now more susceptible to being evicted even though it may be more useful over the longer term. This also poses a problem with any new items entering the cache; they’re subject to being evicted very soon since they start with a low counter.

Going back to our above example cache, suppose item M in the 530 frequency node was cached and incremented very quickly to reach its current position. Then it was never accessed again. From the looks of things it will be there awhile. The larger item M’s value is (in bytes), the more polluted the cache is.

All this leads to variants of LFU which incorporate the idea of “aging”. In some sense this is attempting the best of both LRU and LFU worlds by trying to apply a time bias to the cache. Enter LFU with Dynamic Aging (LFUDA). Now the cache has a property called age which is increased whenever an item is evicted. Specifically, the cache age is set to the evicted item’s priority key. In this way the the cache age will be less than or equal to the minimum priority key in the cache

Now we see why maintaining an item’s priority key is useful. Under LFUDA, an item’s priority key is set to that item’s frequency plus the cache age. The priority key is set/re-set every time that item is accessed and also when its first added to the cache.

So the priority key of an item under pure LFU looks like this:

Ki = Fi

where Ki is the priority key and Fi is the access frequency of the item i. I’ve also been referring to frequency as hits.

And now the priority key of an item under LFUDA:

Ki = Fi + L

where L is the cache age.

Recall, we want to address the issue of a previously very popular object that now becomes unpopular. Before it would remain in the cache for a long time thereby preventing the newly or less popular objects from replacing it. Now through Dynamic Aging the priority key of such objects is reduced relative to the rest of the cache eventually making it eligible for eviction.

Continuing from the above example, lets also say the cache is full, the cache age is 0 and we perform a Set operation for item BE. From the lowest frequency node (Key: 4) item TT is evicted since we need to make room (we’ll assume only evicting item TT clears enough room in the cache). The cache age is updated from 0 to the newly evicted item TT’s priority key (4). The newly set item, BE, now has its hits counter set to 1. Its priority key is updated accordingly to 1 + the Cache Age (4) = 5. Since there is already a frequency node 5, the new item, BE, is added to its list of items.

Now what happens when item W (on frequency node 23) is accessed again? Again, its hits counter is incremented and its priority key updated (23 + 1 + Cache Age (4) = 28). There is no existing frequency node representing a priority key of 28, so its created, inserted into the frequency list, and item W is added to its (Key: 28) items list.

So now you can hopefully see how the aging factor will eventually evict any previously popular items which may now be polluting the cache. The “aging” happens relative to items being accessed around it.

Why Greedy Dual-Size with Frequency

I mentioned before the issue of cache pollution becoming more impactful when the polluting items contain large-sized values by taking up valuable space in the cache. The Greedy Dual-Size with Frequency policy (GDSF) counters this by considering the size in its priority key calculation. Somewhat aggressively, the priority key now divides an item’s frequency by its size. So for some given frequency a larger sized item will actually have a lower priority key compared to some smaller sized item with the same frequency. In this way the larger the item the more popular it will need to be to stay in the cache. Cache space is really prioritized in this policy and its likely over time the cache will be made up of more smaller sized items.

Ki = Fi / Si + L

Where Si is the size of item i in the cache.

So LFUDA combats cache pollution by aging the cache and making less recently accessed objects eligible for eviction sooner than they otherwise would have been under LFU. GDSF takes this a step further by prioritizing smaller objects in the cache, making current pollution less impactful. It should be noted, of course, this could have adverse effects if the larger objects should actually have stayed in the cache longer.

Works in Theory, but does it work in practice?

I put all this into practice in a Go package at https://github.com/bparli/lfuda-go. The package contains policies for LFU, LFUDA, and GDSF as described above.

To compare the LFU, LFUDA, and GDSF cache performances I wrote a simple caching file server. For the purposes of this exercise it just serves from a directory of various sized bin files and is set to cache 100MB. At this size, some of the files in the directory will take up ~3% of the cache space so under contention pollution could be important. Most of the files are much smaller, however. Finally, to simulate a usage pattern, I wrote a simple client script; the client will GET random files from the server, but some more often then others. The first 10 files will be very frequently accessed, the next 190 less frequently. Then to simulate the pollution issues described above, files not likely to be requested again (or not very often) are accessed in spurts to make them temporarily popular.

First, the response times were generally the same for all the cache policies. From the No Cache times, we can see the caches all buy us some improved performance.

+----------+------------+------+-----+
| Policy | Mean (ms) | 95% | 99% |
+----------+------------+------+-----+
| LFU | 1.33 | 2 | 3 |
| LFUDA | 1.36 | 2 | 3 |
| GDSF | 1.33 | 2 | 3 |
| No Cache | 1.9 | 3 | 4 |
+----------+------------+------+-----+

Now for the more interesting part; in terms of cache hit ratio LFUDA outperformed regular LFU indicating the polluting element of the usage pattern did inhibit LFU to some extent. Interestingly though, GDSF outperformed both.

The other interesting part of the results is the length of the cache when the test finishes. While LFU and LFUDA were roughly the same length, GDSF contained more than twice as many items at the end of the test. As noted above, GDSF aggressively prioritizes smaller items and this is reflected in the Cache Length column of the results. At least for this usage pattern, keeping as many items in the cache as possible yields the best cache performance.

+--------+-------------------+--------------+
| Policy | Hits/Misses Ratio | Cache Length |
+--------+-------------------+--------------+
| LFU | 4.83 | 64 |
| LFUDA | 5.87 | 57 |
| GDSF | 6.77 | 154 |
+--------+-------------------+--------------+

Although the cache hits ratio indicates better performance, the response times table don’t indicate anything as obvious. I suspect this is because the cost of loading objects from disk on the example file server isn’t actually very high. So although the cache buys some performance for this pretend usage pattern, it would be much more obvious if the cost of fetching objects (say from over the internet) were higher.

Conclusion

So Dynamic Aging can be a powerful extension of the standard LFU policy. By aging the cache, formerly popular items are proactively aged and evicted, thus reducing cache pollution. It should also be noted there are LFU with Aging policies, however, they have higher management overhead in that they require tuning parameters to function well. The Dynamic Aging variants as described here are parameter-less which makes their adoption that much more compelling.

References

--

--