I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:
sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism
COPY - causes timeouts on quite fast EC2 instances for big number of records
CAPTURE - like above, causes timeouts
reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second
I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores
Is there any better option for export gigabytes of data from Cassandra to CSV?
解决方案
Because using COPY will be quite challenging when you are trying to export a table with millions of rows from Cassandra, So what I have done is to create simple tool to get the data chunk by chunk (paginated) from cassandra table and export it to CSV.
Look at my example solution using java library from datastax.