Misra and Gries defined the heavy-hitters problem
(though they did not introduce the term heavy-hitters[4]) and described the first algorithm
for it in the paper Finding repeated elements.[1] Their algorithm
extends the Boyer-Moore majority finding algorithm
in a significant way.
One version of the heavy-hitters problem is as follows: Given is a
bagb of n elements and an integer k ≥ 2. Find the values that
occur more than n ÷ k times in b. The Misra-Gries algorithm solves
the problem by making two passes over the values in b, while storing
at most k values from b and their number of occurrences during the
course of the algorithm.
A bag is like a set in which the same value may occur multiple
times. Assume that a bag is available as an array b[0:n – 1] of n elements.
In the abstract description of the algorithm, we treat b
and its segments also as bags. Henceforth, a heavy hitter of
bag b is a value that occurs more than n ÷ k times in it, for some integer k, k≥2.
A k-reduced bag for bag b is derived from b by
repeating the following operation until no longer possible: Delete k distinct elements from b. From its definition, a k-reduced bag contains fewer than k different values.
The following theorem is easy to prove:
Theorem 1. Each heavy-hitter of b is an element of a k-reduced bag for b.
The first pass of the heavy-hitters computation constructs a k-reduced
bag t. The second pass declares an element of t to be a heavy-hitter if
it occurs more than n ÷ k times in b. According to Theorem 1, this
procedure determines all and only the heavy-hitters. The second pass
is easy to program, so we describe only the first pass.
In order to construct t, scan the values in b in arbitrary order, for
specificity the following algorithm scans them in the order of
increasing indices. Invariant P of the
algorithm is that t is a k-reduced bag for the scanned values and d is
the number of distinct values in t. Initially, no value has been
scanned, t is the empty bag, and d is zero.
P: 0 ≤ i ≤ n t is a k-reduced bag for b[0:i – 1] d is the number of distinct values in t0 ≤ d < k
Whenever element b[i] is scanned, in order to preserve the
invariant: (1) if b[i] is not in t, add it to t and increase d by 1,
(2) if b[i] is in t, add it to t but don't modify d, and
(3) if d becomes equal to k, reduce t by deleting k distinct values from
it and update d appropriately.
algorithm Misra–Gries is
t, d := { }, 0
for i from 0 to n-1 doif b[i] t then
t, d:= t ∪ {b[i]}, d+1
else
t, d:= t ∪ {b[i]}, d
endifif d = k then
Delete k distinct values from t; update dendifendfor
A possible implementation of t is as a set of pairs of the form
(vi, ci) where each vi is a distinct value in t
and ci is the number of occurrences of vi in t.
Then d is the size of this set. The
step "Delete k distinct values from t" amounts to reducing each ci by
1 and then removing any pair (vi, ci) from the set if ci becomes 0.
Using an AVL tree implementation of t, the algorithm has a
running time of O(n log k). In order to assess the space requirement, assume that the elements of
b can have m possible values, so the storage of a value vi needs
O(log m) bits. Since each counter ci may have a value as high as
n, its storage needs O(log n) bits. Therefore, for O(k) value-counter pairs,
the space requirement is O(k (log n + log m)).
Summaries
In the field of streaming algorithms, the output of the Misra-Gries algorithm in the first pass may be called a summary, and such summaries are used to solve the frequent elements problem in the data stream model. A streaming algorithm makes a small, bounded number of
passes over a list of data items called a stream. It processes the elements using at most logarithmic amount of extra space in the size of the list to produce an answer.
^A number of articles cite Misra-Gries[1] as one of the first to provide a heavy hitters algorithm.
For example, see David P. Woodruff's article heavy hitter algorithms in data streams[2] and the article by Pandey et. al. on heavy hitters using external memory.[3]