2024-08-30

Shortcomings of the Redis 5 Expiration Algorithm

Background

A few months ago, I encountered a peculiar problem with the Redis 5 expiration key algorithm. Whenever a large amount of keys was being expired at the same time, the service my team owned would suffer from a significant latency spike.

After some analysis, it was determined that the cause was from legacy business logic that called an EXPIRE on a set of keys with cardinality on the order of hundred thousands to millions. The business logic would set the same expiration time for all the keys so that all of these keys would have a time-to-live (TTL) timestamp ending at the exact same time. To make matters worse, these latency spikes would occur every hour because the business logic ran this mass expiration on a schedule.

End users of Redis like myself often don't consider TTL to have a considerable impact on performance. In this case, mass expirations did have a perceivable impact on performance. How does Redis actually handle keys that have expired?

A quick look into Redis 5's key expiration algorithm

From the Redis documentation, Redis employs a lazy deletion algorithm. Expired keys are not necessarily deleted the moment their TTL is up.

Redis keys are expired in two ways: a passive way and an active way.

A key is passively expired when a client tries to access it and the key is timed out.

However, this is not enough as there are expired keys that will never be accessed again. These keys should be expired anyway, so periodically, Redis tests a few keys at random amongst the set of keys with an expiration. All the keys that are already expired are deleted from the keyspace.

—Redis.io

Why does Redis perform this lazy deletion? Deleting a key the moment it expires requires constant checking of whether or not that key is expired. It does not make sense to waste CPU cycles polling keys at a such high rate so that keys are deleted the moment they expire. Instead, Redis expires keys when they are accessed or based on a lazy batch deletion algorithm which randomly samples keys and runs at the millisecond level.

We can see the entire logic for the Redis 5 lazy expiration algorithm over in expire.c. The bulk of the logic is wrapped in activeExpireCycle.

void activeExpireCycle(int type)

According to the definitions in server.h, there are two types of expiration cycles which are conditionally run depending on the input to the activeExpireCycle function:

#define ACTIVE_EXPIRE_CYCLE_SLOW 0
#define ACTIVE_EXPIRE_CYCLE_FAST 1

Here is a brief explanation of the cycles and expiration algorithm straight from the source.

/* Try to expire a few timed out keys. The algorithm used is adaptive and
 * will use few CPU cycles if there are few expiring keys, otherwise
 * it will get more aggressive to avoid that too much memory is used by
 * keys that can be removed from the keyspace.
 *
 * No more than CRON_DBS_PER_CALL databases are tested at every
 * iteration.
 *
 * This kind of call is used when Redis detects that timelimit_exit is
 * true, so there is more work to do, and we do it more incrementally from
 * the beforeSleep() function of the event loop.
 *
 * Expire cycle type:
 *
 * If type is ACTIVE_EXPIRE_CYCLE_FAST the function will try to run a
 * "fast" expire cycle that takes no longer than EXPIRE_FAST_CYCLE_DURATION
 * microseconds, and is not repeated again before the same amount of time.
 *
 * If type is ACTIVE_EXPIRE_CYCLE_SLOW, that normal expire cycle is
 * executed, where the time limit is a percentage of the REDIS_HZ period
 * as specified by the ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC define. */

To supplement the comments in the codebase, here is some more context on how the function is called in the database:

  • The activeExpireCycle function is called with ACTIVE_EXPIRE_CYCLE_SLOW in the databasesCron function. From the name of databasesCron, we can tell that this is a periodically scheduled task which is run in the background.
  • The activeExpireCycle function is called with ACTIVE_EXPIRE_CYCLE_FAST in the beforeSleep function. This means that the fast cycle is run as a part of the Redis event loop.

Deep dive into the expiration algorithm

To understand the expiration logic, let's jump right into the main execution loop of the expiration algorithm.

/* Continue to expire if at the end of the cycle more than 25%
 * of the keys were expired. */
do {
    unsigned long num, slots;
    long long now, ttl_sum;
    int ttl_samples;
    iteration++;
 
    /* If there is nothing to expire try next DB ASAP. */
    if ((num = dictSize(db->expires)) == 0) {
        db->avg_ttl = 0;
        break;
    }
    slots = dictSlots(db->expires);
    now = mstime();
 
    /* When there are less than 1% filled slots getting random
     * keys is expensive, so stop here waiting for better times...
     * The dictionary will be resized asap. */
    if (num && slots > DICT_HT_INITIAL_SIZE &&
        (num*100/slots < 1)) break;
 
    /* The main collection cycle. Sample random keys among keys
     * with an expire set, checking for expired ones. */
    expired = 0;
    ttl_sum = 0;
    ttl_samples = 0;
 
    if (num > ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP)
        num = ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP;
 
    while (num--) {
        dictEntry *de;
        long long ttl;
 
        if ((de = dictGetRandomKey(db->expires)) == NULL) break;
        ttl = dictGetSignedIntegerVal(de)-now;
        if (activeExpireCycleTryExpire(db,de,now)) expired++;
        if (ttl > 0) {
            /* We want the average TTL of keys yet not expired. */
            ttl_sum += ttl;
            ttl_samples++;
        }
        total_sampled++;
    }
    total_expired += expired;
 
    /* Update the average TTL stats for this database. */
    if (ttl_samples) {
        long long avg_ttl = ttl_sum/ttl_samples;
 
        /* Do a simple running average with a few samples.
         * We just use the current estimate with a weight of 2%
         * and the previous estimate with a weight of 98%. */
        if (db->avg_ttl == 0) db->avg_ttl = avg_ttl;
        db->avg_ttl = (db->avg_ttl/50)*49 + (avg_ttl/50);
    }
 
    /* We can't block forever here even if there are many keys to
     * expire. So after a given amount of milliseconds return to the
     * caller waiting for the other active expire cycle. */
    if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */
        elapsed = ustime()-start;
        if (elapsed > timelimit) {
            timelimit_exit = 1;
            server.stat_expired_time_cap_reached_count++;
            break;
        }
    }
    /* We don't repeat the cycle if there are less than 25% of keys
     * found expired in the current DB. */
} while (expired > ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP/4);

It appears a bit daunting, so let's break it down piece by piece. At the start of the loop, we immediately encounter a conditional.

    /* If there is nothing to expire try next DB ASAP. */
    if ((num = dictSize(db->expires)) == 0) {
        db->avg_ttl = 0;
        break;
    }

This first conditional of the loop is an early exit condition. If the current Redis database does not have any keys to expire, then immediately move on and skip running the rest of the algorithm on this database. At the same time, this sets num to the number of entries set in the internal hashtable.

    /* When there are less than 1% filled slots getting random
     * keys is expensive, so stop here waiting for better times...
     * The dictionary will be resized asap. */
    if (num && slots > DICT_HT_INITIAL_SIZE &&
        // BLOG NOTE: this is to convert to integer and avoid working with doubles
        (num*100/slots < 1)) break; 

Next, we check whether or not the usage of the dict is greater than 1%. If so, then we don't want to run the expiration algorithm because sampling (more on this in a bit) has a low probability of finding an expired key. This is another early exit condition. slots is defined to be the total amount of available slots in the dict. num, as mentioned before, is the total amount of used slots in the dict.

    if (num > ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP)
        num = ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP;

In each iteration of the do-while loop, we will sample ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP keys in the dict for possible deletion if they are expired. Why sample? In Redis 5, we don't have any way of knowing which keys have expired, so the best we can do is randomly check if a key is expired. If there are fewer than ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP entries in the dict, then we only want to sample num times. The above logic caps num at ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP.

The Redis default number of max samples check per iteration is 20, which is actually not much. However, it is important to remember that we are sampling 20 keys in each iteration of the do-while loop that is part of a single activeExpireCycle call, so in reality, we sample much more keys per expiration cycle.

#define ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP 20 /* Loopkups per loop. */

After the conditionals, we enter the main collection code in the expiration algorithm. Between 0 and 20 times, we attempt to find a random entry in the dict, then run activeExpireCycleTryExpire on it. We keep track of how many keys that we expire in the expired variable.

    while (num--) {
        dictEntry *de;
        long long ttl;
 
        if ((de = dictGetRandomKey(db->expires)) == NULL) break;
        ttl = dictGetSignedIntegerVal(de)-now;
        if (activeExpireCycleTryExpire(db,de,now)) expired++;
        if (ttl > 0) {
            /* We want the average TTL of keys yet not expired. */
            ttl_sum += ttl;
            ttl_samples++;
        }
        total_sampled++;
    }

The activeExpireCycleTryExpire is just a dict key delete function with some extra stuff. I won't go too much into it since it does exactly what it sounds like, but you can take a look at it here.

    /* Update the average TTL stats for this database. */
    if (ttl_samples) {
        long long avg_ttl = ttl_sum/ttl_samples;
 
        /* Do a simple running average with a few samples.
         * We just use the current estimate with a weight of 2%
         * and the previous estimate with a weight of 98%. */
        if (db->avg_ttl == 0) db->avg_ttl = avg_ttl;
        db->avg_ttl = (db->avg_ttl/50)*49 + (avg_ttl/50);
    }

Once the key sampling and deletion is complete, we collect some running average stats for the database.

    /* We can't block forever here even if there are many keys to
     * expire. So after a given amount of milliseconds return to the
     * caller waiting for the other active expire cycle. */
    if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */
        elapsed = ustime()-start;
        if (elapsed > timelimit) {
            timelimit_exit = 1;
            server.stat_expired_time_cap_reached_count++;
            break;
        }
    }

The last conditional in the loop is an important part of the algorithm. The expiration algorithm has a timelimit on how long it can spend sampling and deleting keys. Every few iterations, we check whether or not we have surpassed this timelimit. If so, we mark that we exited from the timelimit by setting timelimit_exit = 1 and then exit out of the expiration loop.

while (expired > ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP/4);

The final piece of the expiration algorithm is the condition in the do-while loop. We continue the loop if the amount of successfully expired keys is greater than 25% of all keys tested for expiration. In the default case, ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP is 20, so we check that we expired more than 5 keys.

This completes the main code logic of the Redis 5 expiration algorithm. To summarize:

  1. Each time the activeExpireCycle function is called, we enter a do-while loop
  2. In this loop, we attempt to test 20 random keys in the dict and expire those which have TTL timestamp past the current time
  3. We exit the loop when we run out of time allocated to run the expiration algorithm or the amount of keys deleted is less than 25% of sampled keys

That's all. Simple, right?

But what about the slow and fast expire cycles? It turns out the main difference between the two is just the timelimit and where they are called. I will cover the differences next.

Slow expire cycle

The slow expire cycle has a timelimit based on ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC and REDIS_HZ.

/* We can use at max ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC percentage of CPU time
 * per iteration. Since this function gets called with a frequency of
 * server.hz times per second, the following is the max amount of
 * microseconds we can spend in this function. */
timelimit = 1000000*ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC/server.hz/100;

These values are all defined in server.h.

#define CONFIG_DEFAULT_DYNAMIC_HZ 1             /* Adapt hz to # of clients.*/
#define CONFIG_DEFAULT_HZ        10             /* Time interrupt calls/sec. */
#define CONFIG_MIN_HZ            1
#define CONFIG_MAX_HZ            500
 
// ...
 
#define ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC 25 /* CPU max % for keys collection */

By default, ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC is set to 25 and server.hz is set to 10. This gives us a timelimit of 25 milliseconds. The slow cycle can reach the timelimit if it uses too much time without expiring fewer than 25% of tested keys. As a refresher, recall that it is called based on a cron schedule. This scheduling is slower, so it is very possible that a high amount of expired keys can accumulate between calls to the slow cycle. To decrease the amount of key expirations that the slow cycle must handle, there exists a fast cycle that is shorter, but called more often.

Fast expire cycle

As mentioned earlier, the fast cycle is called at the end of each event loop cycle to quickly clean up a few keys every loop cycle. The fast expire cycle attempts to expire keys within a short time limit. Since it is called in the main event loop, it is run much more often than the slow cycle. From the source code, we can see that the fast expire timelimit is based on the EXPIRE_FAST_CYCLE_DURATION amount, which is 1 millisecond by default.

#define ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1000 /* Microseconds */

Because the fast cycle is called frequently, there is also an attempt to skip the fast cycle and perform a no-op if the previous fast cycle did not reach its timelimit of 1 millisecond (i.e. there are barely any expired keys).

if (type == ACTIVE_EXPIRE_CYCLE_FAST) {
    /* Don't start a fast cycle if the previous cycle did not exit
     * for time limit. Also don't repeat a fast cycle for the same period
     * as the fast cycle total duration itself. */
    if (!timelimit_exit) return;
    if (start < last_fast_cycle + ACTIVE_EXPIRE_CYCLE_FAST_DURATION*2) return;
    last_fast_cycle = start;
}

There is not much more than this on how the fast and slow cycles are different. Now that we understand how Redis 5 performs key expiration and the different cycles, we can easily see what the issue is when the amount of expired keys is high.

The problem

How does this expiration algorithm react when the it is overwhelmed by a high amount of keys being expired at once? If both the fast and slow expire cycles cannot clear enough keys in the amount of time the Redis instance is configured with, then the expiration function will sample for the maximum amount of time each time a cycle is run. The database will spend more CPU cycles sampling for keys to expire instead of completing other requests. This leads to a spike in latency and possible accumulation of expired keys if they are not cleared out fast enough.

The Twitter developers encountered a similar scenario a couple years ago and even wrote their own blog post about it.

Solutions

Twitter's solution was to tune the ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP of the expiration algorithm in order to locate a sweetspot between latency and resource utilization. While this solution worked out for Twitter, the Redis cluster configuration at my company is not directly managed by my team. It is difficult to access low level configurations as the Redis team does not expose an API for fine level parameterization.

Another solution is to upgrade Redis to a newer major version. Redis 6+ uses a radix tree to delete keys. To conceptualize how this works, a key's TTL timestamp is inserted into a trie-esque data structure. To find all keys that have expired, we just need to navigate through the radix tree and delete all keys earlier than the Redis instance's local clock time. Using a radix tree, we avoid having to sample at random because we can directly find the expired keys based off their timestamps. While upgrading Redis versions would have likely solved our problem, this solution is easier said than done. In practice, we would have to perform a live migration of our gigantic Redis clusters.

I wish I could say that I ended up compiling a custom version of Redis with an optimized configuration like Twitter or porting over a radix tree from Redis 6. What actually ended up happening is that I just changed some of our business logic to avoid frequently expiring high amounts of keys at the same time. A boring result, but an expected (and welcomed) one when you are working in industry. Changing low level infrastructure can be a nightmare at a big company :)

Takeaways

This was my first time diving deep into an open source codebase to understand more about certain behaviors and design decisions of the software. Here are some of the things I learned:

  • In use cases where applications have high throughput, we must consider the impact of all operations that are run on across the pipeline. A key expiration in Redis is not a "free" operation that can be handwaved away when we are pushing the limits of the database.
  • The Redis source code contains a lot of comments that make it a pretty easy read. Give it a look sometime! You'll be surprised by how intuitive a lot of it is.
  • The existence of lolwut.c which prints the Redis version and some ASCII art when you run the LOLWUT command.