English 中文(简体)
Cassandra multiget performance
原标题:

I ve got a cassandra cluster with a fairly small number of rows (2 million or so, which I would hope is "small" for cassandra). Each row is keyed on a unique UUID, and each row has about 200 columns (give or take a few). All in all these are pretty small rows, no binary data or large amounts of text. Just short strings.

I ve just finished the initial import into the cassandra cluster from our old database. I ve tuned the hell out of cassandra on each machine. There were hundreds of millions of writes, but no reads. Now that it s time to USE this thing, I m finding that read speeds are absolutely dismal. I m doing a multiget using pycassa on anywhere from 500 to 10000 rows at a time. Even at 500 rows, the performance is awful sometimes taking 30+ seconds.

What would cause this type of behavior? What sort of things would you recommend after a large import like this? Thanks.

最佳回答

Sounds like you are io-bottlenecked. Cassandra does about 4000 reads/s per core, IF your data fits in ram. Otherwise you will be seek-bound just like anything else.

I note that normally "tuning the hell" out of a system is reserved for AFTER you start putting load on it. :)

See:

问题回答

Is it an option to split up the multi-get into smaller chunks? By doing this you would be able to spread your get across multiple nodes, and potentially increase your performance, both by spreading the load across nodes and having smaller packets to deserialize.

That brings me to the next question, what is your read consistency set to? In addition to an IO bottleneck as @jbellis mentioned, you could also have a network traffic issue if you are requiring a particularly high level of consistency.





相关问题
How does Voldemort compare to Cassandra?

How does Voldemort compare to Cassandra? I m not talking about size of community and only want to hear from people who have actually used both. Especially I m interested in: How they dynamically ...

How does Cassandra rebalance when nodes go down?

Does anyone have experience with Cassandra when nodes go down or are unavailable? I am mostly interested in whether the cluster rebalances and what happens when the nodes come online, or are replaced ...

Cassandra time series data

We are looking at using Cassandra to store a stream of information coming from various sources. One issue we are facing is the best way to query between two dates. For example we will need to ...

Picking a database technology

We re setting out to build an online platform (API, Servers, Data, Wahoo!). For context, imagine that we need to build something like twitter, but with the comments (tweets) organized around a live ...

Row count of a column family in Cassandra

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count. For instance, if I have a column family containing users and ...

Update an existing column value

What happens when a new value for an existing column is added? Will the older value be overwritten by the new value? Or the older value will also retain and can be retrieved (similar to simpleDB)?

Cassandra Vs Amazon SimpleDB

I m working on an application where data size and SQL queries are going to be heavy. I am thinking between Cassandra or Amazon SimpleDB. Can you please suggest which is more suitable in this kind of ...

Cassandra load balancing with an ordered partitioner?

So I see here that Cassandra does not have automatic load balancing, which comes into view when using the ordered partitioner (a certain common range of values of a group of rows would be stored on a ...

热门标签