English 中文(简体)
Which Hadoop product is more appropriate for a quick query on a large data set?
原标题:
  • 时间:2009-12-12 03:12:41
  •  标签:
  • hadoop

I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set)

The queries will be performed against chip sequencing data. Each record is one line in a file. To be clear below shows a sample record in the data set.

one line (record) looks like:

1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 ***103570835*** F .. 23G 24C

The highlighted field is called "position of match" and the query we are interested in is the # of sequences in a certain range of this "position of match". For instance the range can be "position of match" > 200 and "position of match" + 36 < 200,000.

Any suggestions on the Hadoop product I should start with to accomplish the task? HBase,Pig,Hive, or ...?

问题回答

Rough guideline: If you need lots of queries that return fast and do not need to aggregate data, you want to use HBase. If you are looking at tasks that are more analysis and aggregation-focused, you want Pig or Hive.

HBase allows you to specify start and end rows for scans, meaning it should be satisfy the query example you provide, and seems most appropriate for your use case.

For posterity, here s the answer Xueling received on the Hadoop mailing list:

First, further detail from Xueling:

The datasets wont be updated often. But the query against a data set is frequent. The quicker the query, the better. For example we have done testing on a Mysql database (5 billion records randomly scattered into 24 tables) and the slowest query against the biggest table (400,000,000 records) is around 12 mins. So if using any Hadoop product can speed up the search then the product is what we are looking for.

The response, from Cloudera s Todd Lipcon:

In that case, I would recommend the following:

  1. Put all of your data on HDFS
  2. Write a MapReduce job that sorts the data by position of match
  3. As a second output of this job, you can write a "sparse index" - basically a set of entries like this:

where you re basically giving offsets into every 10K records or so. If you index every 10K records, then 5 billion total will mean 100,000 index entries. Each index entry shouldn t be more than 20 bytes, so 100,000 entries will be 2MB. This is super easy to fit into memory. (you could probably index every 100th record instead and end up with 200MB, still easy to fit in memory)

Then to satisfy your count-range query, you can simply scan your in-memory sparse index. Some of the indexed blocks will be completely included in the range, in which case you just add up the "number of entries following" column. The start and finish block will be partially covered, so you can use the file offset info to load that file off HDFS, start reading at that offset, and finish the count.

Total time per query should be <100ms no problem.

A few subsequent replies suggested HBase.

You could also take a short look at JAQL (http://code.google.com/p/jaql/), but unfortunately it s for querying JSON data. But maybe this helps anyway.

You may need to look at No-SQL Database approaches like HBase or Cassandra. I would prefer HBase, as it has a growing community.





相关问题
Hadoop - namenode is not starting up

I am trying to run hadoop as a root user, i executed namenode format command hadoop namenode -format when the Hadoop file system is running. After this, when i try to start the name node server, it ...

What default reducers are available in Elastic MapReduce?

I hope I m asking this in the right way. I m learning my way around Elastic MapReduce and I ve seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In ...

Establishing Eclipse project environment for HadoopDB

I have checked-out a project from SourceForge named HadoopDB. It uses some class in another project named Hive. I have used Eclipse Java build path setting to link source to the Hive project root ...

Hadoop: intervals and JOIN

I m very new to Hadoop and I m currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example: input1: 20091001-20091002 A 20091011-20091104 ...

hadoop- determine if a file is being written to

Is there a way to determine if a file in hadoop is being written to? eg- I have a process that puts logs into hdfs. I have another process that monitors for the existence of new logs in hdfs, but I ...

Building Apache Hive - impossible to resolve dependencies

I am trying out the Apache Hive as per http://wiki.apache.org/hadoop/Hive/GettingStarted and am getting this error from Ivy: Downloaded file size doesn t match expected Content Length for http://...

热门标签