Question

I have 1M words in my dictionary. Whenever a user issue a query on my website, I will see if the query contains the words in my dictionary and increment the counter corresponding to them individually. Here is the example, say if a user type in "Obama is a president" and "Obama" and "president" are in my dictionary, then I should increment the counter by 1 for "Obama" and "president".

And from time to time, I want to see the top 100 words (most queried words). If I use Hbase to store the counter, what schema should I use? -- I have not come up an efficient one yet.

If I use word in my dictionary as row key, and "counter" as column key, then updating counter(increment) is very efficient. But it s very hard to sort and return the top 100.

Anyone can give a good advice? Thanks.

Answer 1

You can use the natural schema (row key as word and column as count) and use IHBase to get a secondary index on the count column. See https://issues.apache.org/jira/browse/HBASE-2037 for the initial implementation; the current code lives at http://github.com/ykulbak/ihbase.

Answer 2

From Adobe s presentation at HBaseCon 2012 (slide 28 in particular), I suggest using two tables and this sort of data structure for the row key:

name

President => 1000
Test => 900

count

429461296:President => dummyvalue
429461396:Test => dummyvalue

The second table s row keys are derived by using Long.MAX_VALUE - count at that point of time.

As you get new words, just add the "count:word" as a row key to the count table. That way, you always have the top words returned first when you scan the table.

Answer 3

Sorting 1M longs can be done in memory, so what?

Store words x,y,z issued at time t as key:t cols:word:x=1 word:y=1 word:z=1 in a table. Then use a MapRed job to sum up counts for words and get the top 100.

This also enables further analysis.

友情链接