Question

I ve got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I d like to do some kind of buffered read.

Ideally I d like to do something like this using Pycassa (no this isn t the proper way to call get, it s just so you can get the idea):

results = {}
start = 0
while True:
    # Fetch blocks of size 500
    buffer = column_family.get(key, column_offset=start, column_count=500)
    if len(buffer) == 0:
        break

    # Merge these results into the main one
    results.update(buffer)

    # Update the offset
    start += len(buffer)

Pycassa (and by extension Cassandra) don t let you do this. Instead you need to specify a column name for column_start and column_finish. This is a problem since I don t actually know what the start or end column names will be. The special value "" can indicate the start or end of the row, but that doesn t work for any of the values in the middle.

So how can I accomplish a buffered read of all the columns in a single row? Thanks.

Answer 1

From the pycassa 1.0.8 documentation

it would appear that you could use something like the following [pseudocode]:

results = {}
start = 0
startColumn = ""
while True:
    # Fetch blocks of size 500

   buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
   # iterate returned values. 
   # set startColumn == previous column_finish.

Remember that on each subsequent call you re only get 99 results returned, because it s also returning startColumn, which you ve already seen. I m not skilled enough in Python yet to iterate on buffer to extract the column names.

Answer 2

In v1.7.1+ of pycassa you can use xget and get a row as wide as 2**63-1 columns.

for col in cf.xget(key, column_count=2**63-1):
    # do something with the column.

友情链接