2022-05-26

Counting Distinct Elements in a Stream

Distinct elements problem

Most problems in computer science become challenging when they are brought to scale. The distinct elements problem is not a difficult one to understand. Given a group of elements that may be identical or unique, determine the number of distinct elements. For example, below we have a list of integers:

[2,3,4,2,2,3,5][2, 3, 4, 2, 2, 3, 5]

If all of these elements are added into a set, then the number of distinct elements is 4.

[2,3,4,2,2,3,5]{2,3,4,5}[2, 3, 4, 2, 2, 3, 5] \to \{2, 3, 4, 5\} {2,3,4,5}=4|\{2, 3, 4, 5\}| = 4

It’s clear that we are just talking about the cardinality of the set.

This is the go-to solution for simple variants of this problem. This is not interesting enough though, so let’s put a small twist on the original problem. In mathematical theory, a set can contain as many elements as it wants, but in practice, a set is stored in memory. What if we have so many elements that we cannot store our set in memory (e.g. we are Google and we want to determine how many unique searches there are per day)?

Suddenly, our solution doesn’t work in practice.

Streams and sketches

Many problems that can be solved with our traditional data structures fail to scale to massive datasets. In data science, we often have a stream of data coming in and we do not nearly have enough space to store all the information in a single database. We still want to holistically analyze this data, but realistically we only see each data point once when it is coming in through a stream. This is because it becomes less feasible to analyze and query all data points from a distributed database to a single location after each data point is stored.

These types of problems are commonly solved with sketch algorithms. Sketch algorithms are a family of algorithms that can compress a stream of data such that the compressed representation can be utilized in answering queries. By allowing for a small amount of error, we can significantly reduce the amount of storage necessary to compute the result of our queries. The catch is that because the dataset is compressed, it may not guarantee the exact solution to the query. However, it is possible to bound the error of the solution in such a way that the solution is still meaningful to us.

MinHash

The core of the our algorithm for the Count-Distinct problem in streams is an elegant application of randomized hashing known as MinHash (Flajolet-Martin 1985). Suppose we have a pairwise-independent randomized hashing function h:U[0,1]h : U \to [0, 1] that has a real valued output. Then, MinHash is defined as:

# X is a dataset stream and x_i is
# an individual element of X
def min_hash(X):
  s = 1
  for x_i in X:
    s = min(s, h(x_i))
  return 1/s - 1

With only four lines of pseudocode, the MinHash algorithm is shockingly short: just hash each element and then return the inverse of the minimum hash subtracted by 1. There isn’t much more—the return value of the min_hash is the estimated total distinct elements in the dataset XX.

Remember, the randomized hash function returns a float in the range of [0,1][0, 1]. The implication here is that this inverse of the minimum hash value of the dataset is at least 1 and at most infinity. As a result, the return value of min_hash is in the range [0,][0, \infty].

The math of MinHash

As with many elegant algorithms, the theory behind the algorithm is usually more complex. Fortunately, the theory for MinHash only requires an introductory university level statistics background (expected value, variance, probability densities, concentration bounds, etc) for the most part.

First, let us get some intuition for MinHash. We will recycle our example from the beginning.

[2,3,4,2,2,3,5][2, 3, 4, 2, 2, 3, 5]

Our hash function should hash duplicate numbers to the same hash value (i.e. all 2s must map to the same float), so there only exists four unique hash values from this list of numbers. An initial observation is that the number of unique hash values is equivalent to the number of distinct elements in the list. Yet, if we try to store the number of unique hash values, we run into the same issue as before: storing unique hash values will utilize an in-memory data structure.

Instead, we keep only the minimum hash value, which must somehow tell us information about how many distinct elements exist in the stream. In our pseudocode, we denote ss as the minimum hash value of the elements in XX. Can we retrieve the number of distinct elements just from the minimum hash ss?


Claim 1. For a dataset with all distinct elements, the CDF of ss is 1(1x)n1 - (1-x)^n, where nn is the number of elements in the dataset.

Proof.  By definition, the CDF of ss is F(x)=Pr(s<x)F(x) = Pr(s < x). Let XX be our dataset, XiX_i be the ii-th element of XX, and hh be our randomized hash function.

F(x)=Pr(s<x)=1Pr(s>x)=1Pr(min(h(X1),,h(Xn))>x)=1i=1nPr(Xi>x)=1(1x)n\begin{align*} F(x) &= Pr(s < x)\\ &= 1 - Pr(s > x)\\ &= 1 - Pr(\min(h(X_1), \ldots, h(X_n)) > x)\\ &= 1 - \prod_{i=1}^n Pr(X_i > x)\\ &= 1 - (1 - x)^n\\ \end{align*}

We start off by determining the CDF of ss. This is useful because the CDF gets us access to calculating the expected value of ss. Notice that our proof has only shown this to be true assuming all elements in XX are unique. Because identical elements are guaranteed to have the same hash from our randomized hash function, we can disregard any extra copies of of the same element. Let dd be the true number of distinct elements in XX. Then, the CDF becomes 1(1x)d1 - (1 - x)^d. From here, we can solve for the expected value of ss.


Claim 2. The expected value of ss is a function of dd. In particular,

E[s]=11+d\mathbb{E}[s] = \frac{1}{1+d}

Proof.  We can first take the derivative of the CDF to calculate the PDF, which we can then use to derive E[s]\mathbb{E}[s].

CDF(x)=F(x)=1(1x)dPDF(x)=ddxCDF(x)=d(1x)d1E[s]=01xPDF(x)dx=01xd(1x)d1dx=1d+1\begin{align*} CDF(x) &= F(x)\\ &= 1 - (1 - x)^d\\ PDF(x) &= \frac{d}{dx} CDF(x)\\ &= d(1 - x)^{d-1}\\ \mathbb{E}[s] &= \int_0^1 xPDF(x)dx\\ &= \int_0^1 xd(1-x)^{d-1}dx\\ &= \frac{1}{d+1}\\ \end{align*}

It then follows from E[s]=1d+1\mathbb{E}[s] = \frac{1}{d+1} that d=1E[s]1d = \frac{1}{\mathbb{E}[s]} - 1.

Intuitively, this result actually makes a lot of sense. If we have dd distinct elements in our dataset, then our expectation is that these dd elements approximately split the range [0,1][0, 1] into d+1d+1 equal parts. Imagine we have three distinct elements. If these three elements were to be perfectly distributed uniformly within the range, then we would have an element approximately at 0.250.25, 0.50.5, and 0.750.75, creating four subranges. The minimum hash ss would be 0.25, yielding a distinct element count approximation of 4. This is exactly 1d+1\frac{1}{d+1}.

Of course, ss isn’t guaranteed to be around 0.2 because everything is random, but we know that as long as ss is close to E[s]\mathbb{E}[s], then our estimated distinct elements d^\hat{d} is close to dd. If we can get ss to be close to E[s]\mathbb{E}[s], then this is good enough to approximate dd. That being said, it is important to recognize that E[d^]d\mathbb{E}[\hat{d}] \neq d.

Note: there is a slightly more involved proof to bound d^\hat{d} which I might append later on. The general intuition is that the bounds on d^\hat{d} correspond to the error ϵ\epsilon on ss, so we can treat any bounds on the error of ss as approximate bounds on d^\hat{d}.

Distinct elements with MinHash

If MinHash returns the number of distinct elements as long as ss is close to its expectation, then aren’t we done here?

Not quite.

Randomized hashing suffers from its element of randomness. It’s not impossible that an unlucky randomized hash function may cause all elements to hash close together such that ss is not close to E[s]\mathbb{E}[s]. Let’s use our previous example.

[2,3,4,2,2,3,5][0.95,0.96,0.97,0.95,0.95,0.96,0.98][2, 3, 4, 2, 2, 3, 5] \to [0.95, 0.96, 0.97, 0.95, 0.95, 0.96, 0.98]

The minimum hash is 0.950.95, so our estimated distinct elements is 1/(0.95)1.051/(0.95) \approx 1.05, about 1.

As with anything random, we can minimize the variance of ss by taking more trials of MinHash. Let kk be a pre-specified value denoting how many trials we want to take and hj:U[0,1]h_j : U \to [0, 1] be the jj-th randomized hash function.

def distinct_elements(X, k):
  s = []
 
  # Initialize k elements of s to 1
  for j in range(k):
    s.append(1)
 
  for x_i in X:
    # Perform MinHash k times per element in X
    for j in range(k):
      s[j] = min(s[j], h_j(x_i))
 
  avg_s = 1/k * sum(s)
  return 1/avg_s - 1

The more hash functions (trials) we use, the better we approximate the number of distinct elements in XX. Let’s take a look at the variance of a single hash function.


Claim 4. Var[s]1(d+1)2Var[s] \leq \frac{1}{(d+1)^2}.

Proof.  We use the definition of variance while trading a precise expression for a more concise expression representing an upper bound.

Var[s]=E[s2]E[s]2=2(d+1)(d+2)1(d+1)22(d+1)21(d+1)21(d+1)2E[s]2\begin{align*} Var[s] &= \mathbb{E}[s^2] - \mathbb{E}[s]^2\\ &= \frac{2}{(d+1)(d+2)} - \frac{1}{(d+1)^2}\\ &\leq \frac{2}{(d+1)^2} - \frac{1}{(d+1)^2}\\ &\leq \frac{1}{(d+1)^2}\\ &\leq \mathbb{E}[s]^2\\ \end{align*}

With kk hash functions, Var[s]1kE[s]2Var[s] \leq \frac{1}{k}\mathbb{E}[s]^2, so our variance decreases with each hash function. We want it to be extremely likely that whatever ss we get in the end is within ϵ\epsilon percent of E[s]\mathbb{E}[s], where ϵ\epsilon is our tolerated percentage of error from its expectation.

sE[s]ϵE[s]|s - \mathbb{E}[s]| \leq \epsilon\mathbb{E}[s]

Fortunately for us, a Chebyshev’s Inequality can exactly bound the likelihood ss is within ϵ\epsilon percent error.

Pr(sE[s]ϵE[s])Var[s](ϵE[s])2E[s]2kϵ2E[s]21kϵ2Pr(sE[s]ϵE[s])=11kϵ2\begin{align*} Pr(|s - \mathbb{E}[s]| \geq \epsilon \mathbb{E}[s]) &\leq \frac{Var[s]}{(\epsilon \mathbb{E}[s])^2}\\ &\leq \frac{\mathbb{E}[s]^2}{k\epsilon^2 \mathbb{E}[s]^2}\\ &\leq \frac{1}{k\epsilon^2}\\ Pr(|s - \mathbb{E}[s]| \leq \epsilon\mathbb{E}[s]) &= 1 - \frac{1}{k \epsilon^2}\\ \end{align*}

The probability that ss exceeds E[s]\mathbb{E}[s] by ϵ\epsilon percent is at most 1kϵ2\frac{1}{k\epsilon^2}. Since ϵ\epsilon and kk are factors we control, we can set them such that they satisfy a predetermined acceptable level of error in ss (and by extension d^\hat{d}).

For example, if we want our sketch to be successful 95% of the time in returning an ss that is within ϵ\epsilon percent, our error rate δ=0.05\delta = 0.05. Then, if we want to ensure that ss is within 1% of its expectation, ϵ=0.01\epsilon = 0.01. The question now is how many hash functions do we need?

δ=1kϵ20.05=1k(0.01)2k=200000\begin{align*} \delta &= \frac{1}{k\epsilon^2}\\ 0.05 &= \frac{1}{k(0.01)^2}\\ k &= 200000\\ \end{align*}

From our derivations, if we utilize 200000 hash functions, 95% of the time that we run our algorithm, we can achieve a result with ss within 1% of E[s]\mathbb{E}[s].

More generally, k=1δϵ2k = \frac{1}{\delta\epsilon^2}.

And with that, we have written an algorithm that can estimate the amount of distinct elements up to an acceptable error with a certain likelihood! Admittedly, our results seem less than ideal. About one out of 20 times, our algorithm fails to provide a distinct element count within our desired error percent. If this sketch is run many times, 5% is still a somewhat nonnegligible error rate. In the next section, I will talk about a neat optimization that decreases our space usage which allows us to use lower error rates. This will increase our success rate for the same amount of space.

Median trick optimization

One part of understanding this algorithm that I handwaved earlier was how the ϵ\epsilon on ss corresponds to the error on d^\hat{d}. Since we don’t actually care about the error of ss, only the error on d^\hat{d} in practice, we assumed that bounding the error on ss would also have corresponding bounds on the error of d^\hat{d}. This much is true. However, the formal proof of the bounds on d^\hat{d} actually limits δ\delta to a certain range, which prevents us from setting an arbitrary δ\delta as our error rate.

The “median trick” is a way to allow us to use any δ\delta that we want, at the cost of an O(log1/δ)O(\log{1/\delta}) constant factor in space usage.

def distinct_elements(X, k, t):
  s = []
  d = []
 
  # For each trial, store k hashes
  for t_i in range(t):
    s.append([])
    for j in range(k):
      s[t_i].append(1)
 
  for x_i in X:
    # Perform t trials
    for t_i in range(t):
      # Perform MinHash k times per element in X
      for j in range(k):
        s[t_i][j] = min(s[t_i][j], h_j(x_i))
 
  for t_i in range(t):
    avg_s = 1/k * sum(s[t_i])
    d.append(1/avg_s - 1)
 
  return median(d)

The median is often used as an alternative to the mean because it is more robust when it comes to outliers. As the median (our new d^\hat{d}) is likely to be within [d(1ϵ),d(1+ϵ)][d(1-\epsilon), d(1+\epsilon)], we can actually relax our error rate in MinHash to a much larger value (for example, we can use 0.2 instead of 0.05) which stays constant and does not become a dependency in our space analysis anymore.

Note: the value of δ\delta here is a bit confusing. In our original analysis, δ\delta was the rate of failure for the kk MinHash algorithm. When we apply the median trick, δ\delta becomes the rate of failure of the entire median trick algorithm. When we say we are relaxing the error rate in MinHash, we are saying we allow the kk MinHash algorithm to fail more often. However, the error rate δ\delta is the failure rate of the algorithm as a whole (probability of not returning a d^\hat{d} within some ϵ\epsilon percent of dd).

I will not go into the proof here because the intuition and statistics required is a more complex than introductory statistics (Chernoff bound). But, the point to take away is that by attempting approximately O(log(1/δ))O(\log(1/\delta)) trials of our original algorithm, we can achieve an arbitrary error rate δ\delta that we can control.

Since the space requirement is loosened, we can achieve even greater accuracy while demanding the same amount of space as before.

Space analysis

The total amount of space is directly dependent on the amount of hash functions. For each hash function, we must store a float. In addition, for each trial, we have to use kk hash functions. Originally, the number of hash functions we used was

k=1δϵ2k = \frac{1}{\delta\epsilon^2}

where δ\delta was our error rate of our sketch and ϵ\epsilon was our margin of error for ss.

After applying the median trick, the number of hash functions becomes

k=O(log1/δϵ2)k = O\left(\frac{\log{1/\delta}}{\epsilon^2}\right)

The main difference between the space usage is the 1δ\frac{1}{\delta} versus clog1/δc\log{1/\delta}, where cc is just a constant induced by the big-O notation.

Finally, take a look at whether or not our space analysis includes the value of dd or nn anywhere! The space that our algorithm requires does not depend on total number of elements at all, making it extremely scalable.

In practice and further reading

There is an issue with the hash function that we described. The function h:U[0,1]h : U \to [0, 1] is not possible in practice because we cannot create a hash function with a continuous range. To solve this, we can just discretize the range between [0,1][0, 1] for hash values.

Alternatively, the HyperLogLog algorithm is a more practical extension of the Flajolet-Martin (MinHash and LogLog). Instead of hashing each element to a continuous range and taking the minimum hash, each element is hashed to a binary number and we take the maximum over the number of leading zeros. The amount of space this algorithm uses is tiny—each hash only uses O(loglogn)O(\log \log n) bits, which is why the algorithm is named this way. The HyperLogLog algorithm is often known for its usage in Google’s large scale systems, which I will provide in the resources below.

Conclusion

And that’s that! The MinHash algorithm honestly blew my mind when I first learned about it. It was so surprising to me that we could sacrifice the exactness of our solution for a drastic decrease in space. Since then, I have been amazed by all the approximation algorithms out there being used for big data. I truly hope that you found this algorithm as interesting as I did! Here are some other fascinating sketch algorithms:

  • Bloom filters
  • Count-min sketch
  • Locality-sensitive hashing

All credit for what I have written goes to Prof. Cameron Musco’s 2021 offering of CS514 at UMass (probably my favorite class I have taken so far), which I will link in the resources below.

Resources