English 中文(简体)
Hbase schema design -- to make sorting easy?
原标题:
  • 时间:2010-03-25 15:28:20
  •  标签:
  • olap
  • hbase

I have 1M words in my dictionary. Whenever a user issue a query on my website, I will see if the query contains the words in my dictionary and increment the counter corresponding to them individually. Here is the example, say if a user type in "Obama is a president" and "Obama" and "president" are in my dictionary, then I should increment the counter by 1 for "Obama" and "president".

And from time to time, I want to see the top 100 words (most queried words). If I use Hbase to store the counter, what schema should I use? -- I have not come up an efficient one yet.

If I use word in my dictionary as row key, and "counter" as column key, then updating counter(increment) is very efficient. But it s very hard to sort and return the top 100.

Anyone can give a good advice? Thanks.

问题回答

You can use the natural schema (row key as word and column as count) and use IHBase to get a secondary index on the count column. See https://issues.apache.org/jira/browse/HBASE-2037 for the initial implementation; the current code lives at http://github.com/ykulbak/ihbase.

From Adobe s presentation at HBaseCon 2012 (slide 28 in particular), I suggest using two tables and this sort of data structure for the row key:

name

President => 1000
Test => 900

count

429461296:President => dummyvalue
429461396:Test => dummyvalue

The second table s row keys are derived by using Long.MAX_VALUE - count at that point of time.

As you get new words, just add the "count:word" as a row key to the count table. That way, you always have the top words returned first when you scan the table.

Sorting 1M longs can be done in memory, so what?

Store words x,y,z issued at time t as key:t cols:word:x=1 word:y=1 word:z=1 in a table. Then use a MapRed job to sum up counts for words and get the top 100.

This also enables further analysis.





相关问题
When is data erased from the OLAP DB?

I am new to OLAP. I understand the table structure and ETL process. I don t understand when data is supposed to be deleted from the fact table. Say I m creating a reporting application for events. ...

IIS 6.0 Application Pool Identity Being Ignored

I am using IIS 6.0 on Windows 2003 in a workgroup, and have created a web which runs in its own pool to connect to a Sqlserver 2005 Analysis Services database using msmdpump.dll. I have set the pool ...

Any scalable OLAP database (web app scale)?

I have an application that requires analytics for different level of aggregation, and that s the OLAP workload. I want to update my database pretty frequently as well. e.g., here is what my update ...

MDX Calculated Member CrossJoin question

I have an MDX query with the following calculated member: with member [Measures].[BBOX] as Count( Filter( CrossJoin([Dim Response].[Response ID].Children, [Dim Question].[Question Text]....

trouble connecting Excel to Analysis Services server

We have an SSAS server with a cube deployed on a server over the WAN.. We are trying to connect to the cube from Excel on various client workstations. The server is not on a domain with the clients. ...

Cant connect to analysis services via excel

I have an analysis services cube in SQL server 2005 which I m connecting to via an excel front end. When I connect via one user its fine, but when I log on to the same machine as another user I get ...

Dimension Security in OLAP Cubes

I have defined dimension security in my OLAP cube by creating roles and assigning users to each roles. Each user in a role can only see the location they belong to. When I browse the cube using a ...

热门标签