2022-05-26

Counting Distinct Elements in a Stream

Distinct elements problem
Streams and sketches
MinHash
The math of MinHash
Distinct elements with MinHash
Median trick optimization
Space analysis
In practice and further reading
Conclusion
Resources

Distinct elements problem

Most problems in computer science become challenging when they are brought to scale. The distinct elements problem is not a difficult one to understand. Given a group of elements that may be identical or unique, determine the number of distinct elements. For example, below we have a list of integers:

[2, 3, 4, 2, 2, 3, 5]

If all of these elements are added into a set, then the number of distinct elements is 4.

[2, 3, 4, 2, 2, 3, 5] \to \{2, 3, 4, 5\}

|\{2, 3, 4, 5\}| = 4

It's clear that we are just talking about the cardinality of the set.

This is the go-to solution for simple variants of this problem. This is not interesting enough though, so let's put a small twist on the original problem. In mathematical theory, a set can contain as many elements as it wants, but in practice, a set is stored in memory. What if we have so many elements that we cannot store our set in memory (e.g. we are Google and we want to determine how many unique searches there are per day)?

Suddenly, our solution doesn't work in practice.

Streams and sketches

Many problems that can be solved with our traditional data structures fail to scale to massive datasets. In data science, we often have a stream of data coming in and we do not nearly have enough space to store all the information in a single database. We still want to holistically analyze this data, but realistically we only see each data point once when it is coming in through a stream. This is because it becomes less feasible to analyze and query all data points from a distributed database to a single location after each data point is stored.

These types of problems are commonly solved with sketch algorithms. Sketch algorithms are a family of algorithms that can compress a stream of data such that the compressed representation can be utilized in answering queries. By allowing for a small amount of error, we can significantly reduce the amount of storage necessary to compute the result of our queries. The catch is that because the dataset is compressed, it may not guarantee the exact solution to the query. However, it is possible to bound the error of the solution in such a way that the solution is still meaningful to us.

MinHash

The core of the our algorithm for the Count-Distinct problem in streams is an elegant application of randomized hashing known as MinHash (Flajolet-Martin 1985). Suppose we have a pairwise-independent randomized hashing function $h : U \to [0, 1]$ that has a real valued output. Then, MinHash is defined as:

# X is a dataset stream and x_i is
# an individual element of X
def min_hash(X):
  s = 1
  for x_i in X:
    s = min(s, h(x_i))
  return 1/s - 1

With only four lines of pseudocode, the MinHash algorithm is shockingly short: just hash each element and then return the inverse of the minimum hash subtracted by 1. There isn't much more—the return value of the min_hash is the estimated total distinct elements in the dataset $X$ .

Remember, the randomized hash function returns a float in the range of $[0, 1]$ . The implication here is that this inverse of the minimum hash value of the dataset is at least 1 and at most infinity. As a result, the return value of min_hash is in the range $[0, \infty]$ .

The math of MinHash

As with many elegant algorithms, the theory behind the algorithm is usually more complex. Fortunately, the theory for MinHash only requires an introductory university level statistics background (expected value, variance, probability densities, concentration bounds, etc) for the most part.

First, let us get some intuition for MinHash. We will recycle our example from the beginning.

[2, 3, 4, 2, 2, 3, 5]

Our hash function should hash duplicate numbers to the same hash value (i.e. all 2s must map to the same float), so there only exists four unique hash values from this list of numbers. An initial observation is that the number of unique hash values is equivalent to the number of distinct elements in the list. Yet, if we try to store the number of unique hash values, we run into the same issue as before: storing unique hash values will utilize an in-memory data structure.

Instead, we keep only the minimum hash value, which must somehow tell us information about how many distinct elements exist in the stream. In our pseudocode, we denote $s$ as the minimum hash value of the elements in $X$ . Can we retrieve the number of distinct elements just from the minimum hash $s$ ?

Claim 1. For a dataset with all distinct elements, the CDF of $s$ is $1 - (1-x)^n$ , where $n$ is the number of elements in the dataset.

Proof. By definition, the CDF of $s$ is $F(x) = Pr(s < x)$ . Let $X$ be our dataset, $X_i$ be the $i$ -th element of $X$ , and $h$ be our randomized hash function.

\begin{align*} F(x) &= Pr(s < x)\\ &= 1 - Pr(s > x)\\ &= 1 - Pr(\min(h(X_1), \ldots, h(X_n)) > x)\\ &= 1 - \prod_{i=1}^n Pr(X_i > x)\\ &= 1 - (1 - x)^n\\ \end{align*}

We start off by determining the CDF of $s$ . This is useful because the CDF gets us access to calculating the expected value of $s$ . Notice that our proof has only shown this to be true assuming all elements in $X$ are unique. Because identical elements are guaranteed to have the same hash from our randomized hash function, we can disregard any extra copies of of the same element. Let $d$ be the true number of distinct elements in $X$ . Then, the CDF becomes $1 - (1 - x)^d$ . From here, we can solve for the expected value of $s$ .

Claim 2. The expected value of $s$ is a function of $d$ . In particular,

\mathbb{E}[s] = \frac{1}{1+d}

Proof. We can first take the derivative of the CDF to calculate the PDF, which we can then use to derive $\mathbb{E}[s]$ .

\begin{align*} CDF(x) &= F(x)\\ &= 1 - (1 - x)^d\\ PDF(x) &= \frac{d}{dx} CDF(x)\\ &= d(1 - x)^{d-1}\\ \mathbb{E}[s] &= \int_0^1 xPDF(x)dx\\ &= \int_0^1 xd(1-x)^{d-1}dx\\ &= \frac{1}{d+1}\\ \end{align*}

It then follows from $\mathbb{E}[s] = \frac{1}{d+1}$ that $d = \frac{1}{\mathbb{E}[s]} - 1$ .

Intuitively, this result actually makes a lot of sense. If we have $d$ distinct elements in our dataset, then our expectation is that these $d$ elements approximately split the range $[0, 1]$ into $d+1$ equal parts. Imagine we have three distinct elements. If these three elements were to be perfectly distributed uniformly within the range, then we would have an element approximately at $0.25$ , $0.5$ , and $0.75$ , creating four subranges. The minimum hash $s$ would be 0.25, yielding a distinct element count approximation of 4. This is exactly $\frac{1}{d+1}$ .

Of course, $s$ isn't guaranteed to be around 0.2 because everything is random, but we know that as long as $s$ is close to $\mathbb{E}[s]$ , then our estimated distinct elements $\hat{d}$ is close to $d$ . If we can get $s$ to be close to $\mathbb{E}[s]$ , then this is good enough to approximate $d$ . That being said, it is important to recognize that $\mathbb{E}[\hat{d}] \neq d$ .

Note: there is a slightly more involved proof to bound $\hat{d}$ which I might append later on. The general intuition is that the bounds on $\hat{d}$ correspond to the error $\epsilon$ on $s$ , so we can treat any bounds on the error of $s$ as approximate bounds on $\hat{d}$ .

Distinct elements with MinHash

If MinHash returns the number of distinct elements as long as $s$ is close to its expectation, then aren't we done here?

Not quite.

Randomized hashing suffers from its element of randomness. It's not impossible that an unlucky randomized hash function may cause all elements to hash close together such that $s$ is not close to $\mathbb{E}[s]$ . Let's use our previous example.

[2, 3, 4, 2, 2, 3, 5] \to [0.95, 0.96, 0.97, 0.95, 0.95, 0.96, 0.98]

The minimum hash is $0.95$ , so our estimated distinct elements is $1/(0.95) \approx 1.05$ , about 1.

As with anything random, we can minimize the variance of $s$ by taking more trials of MinHash. Let $k$ be a pre-specified value denoting how many trials we want to take and $h_j : U \to [0, 1]$ be the $j$ -th randomized hash function.

def distinct_elements(X, k):
  s = []
 
  # Initialize k elements of s to 1
  for j in range(k):
    s.append(1)
 
  for x_i in X:
    # Perform MinHash k times per element in X
    for j in range(k):
      s[j] = min(s[j], h_j(x_i))
 
  avg_s = 1/k * sum(s)
  return 1/avg_s - 1

The more hash functions (trials) we use, the better we approximate the number of distinct elements in $X$ . Let's take a look at the variance of a single hash function.

Claim 4. $Var[s] \leq \frac{1}{(d+1)^2}$ .

Proof. We use the definition of variance while trading a precise expression for a more concise expression representing an upper bound.

\begin{align*} Var[s] &= \mathbb{E}[s^2] - \mathbb{E}[s]^2\\ &= \frac{2}{(d+1)(d+2)} - \frac{1}{(d+1)^2}\\ &\leq \frac{2}{(d+1)^2} - \frac{1}{(d+1)^2}\\ &\leq \frac{1}{(d+1)^2}\\ &\leq \mathbb{E}[s]^2\\ \end{align*}

With $k$ hash functions, $Var[s] \leq \frac{1}{k}\mathbb{E}[s]^2$ , so our variance decreases with each hash function. We want it to be extremely likely that whatever $s$ we get in the end is within $\epsilon$ percent of $\mathbb{E}[s]$ , where $\epsilon$ is our tolerated percentage of error from its expectation.

|s - \mathbb{E}[s]| \leq \epsilon\mathbb{E}[s]

Fortunately for us, a Chebyshev's Inequality can exactly bound the likelihood $s$ is within $\epsilon$ percent error.

\begin{align*} Pr(|s - \mathbb{E}[s]| \geq \epsilon \mathbb{E}[s]) &\leq \frac{Var[s]}{(\epsilon \mathbb{E}[s])^2}\\ &\leq \frac{\mathbb{E}[s]^2}{k\epsilon^2 \mathbb{E}[s]^2}\\ &\leq \frac{1}{k\epsilon^2}\\ Pr(|s - \mathbb{E}[s]| \leq \epsilon\mathbb{E}[s]) &= 1 - \frac{1}{k \epsilon^2}\\ \end{align*}

The probability that $s$ exceeds $\mathbb{E}[s]$ by $\epsilon$ percent is at most $\frac{1}{k\epsilon^2}$ . Since $\epsilon$ and $k$ are factors we control, we can set them such that they satisfy a predetermined acceptable level of error in $s$ (and by extension $\hat{d}$ ).

For example, if we want our sketch to be successful 95% of the time in returning an $s$ that is within $\epsilon$ percent, our error rate $\delta = 0.05$ . Then, if we want to ensure that $s$ is within 1% of its expectation, $\epsilon = 0.01$ . The question now is how many hash functions do we need?

\begin{align*} \delta &= \frac{1}{k\epsilon^2}\\ 0.05 &= \frac{1}{k(0.01)^2}\\ k &= 200000\\ \end{align*}

From our derivations, if we utilize 200000 hash functions, 95% of the time that we run our algorithm, we can achieve a result with $s$ within 1% of $\mathbb{E}[s]$ .

More generally, $k = \frac{1}{\delta\epsilon^2}$ .

And with that, we have written an algorithm that can estimate the amount of distinct elements up to an acceptable error with a certain likelihood! Admittedly, our results seem less than ideal. About one out of 20 times, our algorithm fails to provide a distinct element count within our desired error percent. If this sketch is run many times, 5% is still a somewhat nonnegligible error rate. In the next section, I will talk about a neat optimization that decreases our space usage which allows us to use lower error rates. This will increase our success rate for the same amount of space.

Median trick optimization

One part of understanding this algorithm that I handwaved earlier was how the $\epsilon$ on $s$ corresponds to the error on $\hat{d}$ . Since we don't actually care about the error of $s$ , only the error on $\hat{d}$ in practice, we assumed that bounding the error on $s$ would also have corresponding bounds on the error of $\hat{d}$ . This much is true. However, the formal proof of the bounds on $\hat{d}$ actually limits $\delta$ to a certain range, which prevents us from setting an arbitrary $\delta$ as our error rate.

The "median trick" is a way to allow us to use any $\delta$ that we want, at the cost of an $O(\log{1/\delta})$ constant factor in space usage.

def distinct_elements(X, k, t):
  s = []
  d = []
 
  # For each trial, store k hashes
  for t_i in range(t):
    s.append([])
    for j in range(k):
      s[t_i].append(1)
 
  for x_i in X:
    # Perform t trials
    for t_i in range(t):
      # Perform MinHash k times per element in X
      for j in range(k):
        s[t_i][j] = min(s[t_i][j], h_j(x_i))
 
  for t_i in range(t):
    avg_s = 1/k * sum(s[t_i])
    d.append(1/avg_s - 1)
 
  return median(d)

The median is often used as an alternative to the mean because it is more robust when it comes to outliers. As the median (our new $\hat{d}$ ) is likely to be within $[d(1-\epsilon), d(1+\epsilon)]$ , we can actually relax our error rate in MinHash to a much larger value (for example, we can use 0.2 instead of 0.05) which stays constant and does not become a dependency in our space analysis anymore.

Note: the value of $\delta$ here is a bit confusing. In our original analysis, $\delta$ was the rate of failure for the $k$ MinHash algorithm. When we apply the median trick, $\delta$ becomes the rate of failure of the entire median trick algorithm. When we say we are relaxing the error rate in MinHash, we are saying we allow the $k$ MinHash algorithm to fail more often. However, the error rate $\delta$ is the failure rate of the algorithm as a whole (probability of not returning a $\hat{d}$ within some $\epsilon$ percent of $d$ ).

I will not go into the proof here because the intuition and statistics required is a more complex than introductory statistics (Chernoff bound). But, the point to take away is that by attempting approximately $O(\log(1/\delta))$ trials of our original algorithm, we can achieve an arbitrary error rate $\delta$ that we can control.

Since the space requirement is loosened, we can achieve even greater accuracy while demanding the same amount of space as before.

Space analysis

The total amount of space is directly dependent on the amount of hash functions. For each hash function, we must store a float. In addition, for each trial, we have to use $k$ hash functions. Originally, the number of hash functions we used was

k = \frac{1}{\delta\epsilon^2}

where $\delta$ was our error rate of our sketch and $\epsilon$ was our margin of error for $s$ .

After applying the median trick, the number of hash functions becomes

k = O\left(\frac{\log{1/\delta}}{\epsilon^2}\right)

The main difference between the space usage is the $\frac{1}{\delta}$ versus $c\log{1/\delta}$ , where $c$ is just a constant induced by the big-O notation.

Finally, take a look at whether or not our space analysis includes the value of $d$ or $n$ anywhere! The space that our algorithm requires does not depend on total number of elements at all, making it extremely scalable.

In practice and further reading

There is an issue with the hash function that we described. The function $h : U \to [0, 1]$ is not possible in practice because we cannot create a hash function with a continuous range. To solve this, we can just discretize the range between $[0, 1]$ for hash values.

Alternatively, the HyperLogLog algorithm is a more practical extension of the Flajolet-Martin (MinHash and LogLog). Instead of hashing each element to a continuous range and taking the minimum hash, each element is hashed to a binary number and we take the maximum over the number of leading zeros. The amount of space this algorithm uses is tiny—each hash only uses $O(\log \log n)$ bits, which is why the algorithm is named this way. The HyperLogLog algorithm is often known for its usage in Google's large scale systems, which I will provide in the resources below.

Conclusion

And that's that! The MinHash algorithm honestly blew my mind when I first learned about it. It was so surprising to me that we could sacrifice the exactness of our solution for a drastic decrease in space. Since then, I have been amazed by all the approximation algorithms out there being used for big data. I truly hope that you found this algorithm as interesting as I did! Here are some other fascinating sketch algorithms:

Bloom filters
Count-min sketch
Locality-sensitive hashing

All credit for what I have written goes to Prof. Cameron Musco's 2021 offering of CS514 at UMass (probably my favorite class I have taken so far), which I will link in the resources below.