shell 引号解析逗号,如何使用Shell从带引号逗号的CSV提取列？

最新推荐文章于 2023-07-06 10:46:42 发布

琥蛮

最新推荐文章于 2023-07-06 10:46:42 发布

阅读量422

点赞数 1

文章标签： shell 引号解析逗号

I have a CSV file, but unlike in related questions, it has some columns containing double-quoted strings with commas, e.g.

foo,bar,baz,quux

11,"first line, second column",13.0,6

210,"second column of second line",23.1,5

(of course it's longer, and the number of quoted commas is not necessarily one or 0, nor is the text predictable.) The text might also have (escaped) double-quotes within double-quotes, or not have double-quotes altogether for a typically-quoted field. The only assumption we can make is that there are no quoted newlines, so we can split lines trivially using \n.

Now, I'd like to extract a specific column (say, the third one) - say, to be printed on standard output, one value per line. I can't simply use commas as field delimiters (and thus, e.g., use cut); rather, I need to something more sophisticated. What could that be?

Note: I'm using bash on a Linux system.

解决方案

Here is a quick and dirty Python csvcut. The Python csv library already knows everything about various CSV dialects etc so you just need a thin wrapper.

The first argument should express the index of the field you wish to extract, like

csvcut 3 sample.csv

to extract the third column from the (possibly, quoted etc) CSV file sample.csv.

#!/usr/bin/env python3

import csv

import sys

writer=csv.writer(sys.stdout)

# Python indexing is zero-based

col = 1+int(sys.argv[1])

for input in sys.argv[2:]:

with open(input) as handle:

for row in csv.reader(handle):

writer.writerow(row[col])

To do: error handling, extraction of multiple columns. (Not hard per se; use row[2:5] to extract columns 3, 4, and 5; but I'm too lazy to write a proper command-line argument parser.)