For posterity, here s the answer Xueling received on the Hadoop mailing list:
First, further detail from Xueling:
The datasets wont be updated often.
But the query against a data set is
frequent. The quicker the query, the
better. For example we have done
testing on a Mysql database (5 billion
records randomly scattered into 24
tables) and the slowest query against
the biggest table (400,000,000
records) is around 12 mins. So if
using any Hadoop product can speed up
the search then the product is what we
are looking for.
The response, from Cloudera s Todd Lipcon:
In that case, I would recommend the
following:
- Put all of your data on HDFS
- Write a MapReduce job that sorts the data by position of match
As a second output of this job, you can write a "sparse index" -
basically a set of entries like this:
where you re basically giving offsets
into every 10K records or so. If you
index every 10K records, then 5
billion total will mean 100,000 index
entries. Each index entry shouldn t be
more than 20 bytes, so 100,000 entries
will be 2MB. This is super easy to fit
into memory. (you could probably index
every 100th record instead and end up
with 200MB, still easy to fit in
memory)
Then to satisfy your count-range
query, you can simply scan your
in-memory sparse index. Some of the
indexed blocks will be completely
included in the range, in which case
you just add up the "number of entries
following" column. The start and
finish block will be partially
covered, so you can use the file
offset info to load that file off
HDFS, start reading at that offset,
and finish the count.
Total time per query should be <100ms
no problem.
A few subsequent replies suggested HBase.