CSV: Delimiter Separated Values
Pros: Human readable, all tools support it.
Cons:
- IO/Storage inefficent (uncompressed)
- No richer types - all are strings
- Linear scanning (projections and predicates) if all file is in a single file then doomed
- Other issues: delimiter in data, new lines within data.
Json and XML much more verbose than CSV in terms of storage
Pros: Readable, some level of schema support
Cons:
- Duplicated schema
- Horrable in terms of storage
- Not splittable *, linear lookups
- Aggregations require all data to be loaded into memory
Binary Formats
Thrift (by Facebook)
column names do not matter;
ThriftC- generate thrift entity run by thriftC-regenerate the whole