English 中文(简体)
Cassandra buffered read of millions of columns
原标题:

I ve got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I d like to do some kind of buffered read.

Ideally I d like to do something like this using Pycassa (no this isn t the proper way to call get, it s just so you can get the idea):

results = {}
start = 0
while True:
    # Fetch blocks of size 500
    buffer = column_family.get(key, column_offset=start, column_count=500)
    if len(buffer) == 0:
        break

    # Merge these results into the main one
    results.update(buffer)

    # Update the offset
    start += len(buffer)

Pycassa (and by extension Cassandra) don t let you do this. Instead you need to specify a column name for column_start and column_finish. This is a problem since I don t actually know what the start or end column names will be. The special value "" can indicate the start or end of the row, but that doesn t work for any of the values in the middle.

So how can I accomplish a buffered read of all the columns in a single row? Thanks.

最佳回答

From the pycassa 1.0.8 documentation

it would appear that you could use something like the following [pseudocode]:

results = {}
start = 0
startColumn = ""
while True:
    # Fetch blocks of size 500

   buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
   # iterate returned values. 
   # set startColumn == previous column_finish. 

Remember that on each subsequent call you re only get 99 results returned, because it s also returning startColumn, which you ve already seen. I m not skilled enough in Python yet to iterate on buffer to extract the column names.

问题回答

In v1.7.1+ of pycassa you can use xget and get a row as wide as 2**63-1 columns.

for col in cf.xget(key, column_count=2**63-1):
    # do something with the column.




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签