English 中文(简体)
Efficiently estimating the number of unique elements in a large list
原标题:

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.

I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.

Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).

Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).

I m wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.

The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.

最佳回答

You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.

问题回答

If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.

This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.

Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签