I have a CSV file, but unlike in related questions, it has some columns containing double-quoted strings with commas, e.g.
foo,bar,baz,quux
11,"first line, second column",13.0,6
210,"second column of second line",23.1,5
(of course it's longer, and the number of quoted commas is not necessarily one or 0, nor is the text predictable.) The text might also have (escaped) double-quotes within double-quotes, or not have double-quotes altogether for a typically-quoted field. The only assumption we can make is that there are no quoted newlines, so we can split lines trivially using \n.
Now, I'd like to extract a specific column (say, the third one) - say, to be printed on standard output, one value per line. I can't simply use commas as field delimiters (and thus, e.g., use cut); rather, I need to something more sophisticated. What could that be?
Note: I'm using bash on a Linux system.
解决方案
Here is a quick and dirty Python csvcut. The Python csv library already knows everything about various CSV dialects etc so you just need a thin wrapper.
The first argument should express the index of the field you wish to extract, like
csvcut 3 sample.csv
to extract the third column from the (possibly, quoted etc) CSV file sample.csv.
#!/usr/bin/env python3
import csv
import sys
writer=csv.writer(sys.stdout)
# Python indexing is zero-based
col = 1+int(sys.argv[1])
for input in sys.argv[2:]:
with open(input) as handle:
for row in csv.reader(handle):
writer.writerow(row[col])
To do: error handling, extraction of multiple columns. (Not hard per se; use row[2:5] to extract columns 3, 4, and 5; but I'm too lazy to write a proper command-line argument parser.)