shell 引号 解析 逗号,如何使用Shell从带引号逗号的CSV提取列?

I have a CSV file, but unlike in related questions, it has some columns containing double-quoted strings with commas, e.g.

foo,bar,baz,quux

11,"first line, second column",13.0,6

210,"second column of second line",23.1,5

(of course it's longer, and the number of quoted commas is not necessarily one or 0, nor is the text predictable.) The text might also have (escaped) double-quotes within double-quotes, or not have double-quotes altogether for a typically-quoted field. The only assumption we can make is that there are no quoted newlines, so we can split lines trivially using \n.

Now, I'd like to extract a specific column (say, the third one) - say, to be printed on standard output, one value per line. I can't simply use commas as field delimiters (and thus, e.g., use cut); rather, I need to something more sophisticated. What could that be?

Note: I'm using bash on a Linux system.

解决方案

Here is a quick and dirty Python csvcut. The Python csv library already knows everything about various CSV dialects etc so you just need a thin wrapper.

The first argument should express the index of the field you wish to extract, like

csvcut 3 sample.csv

to extract the third column from the (possibly, quoted etc) CSV file sample.csv.

#!/usr/bin/env python3

import csv

import sys

writer=csv.writer(sys.stdout)

# Python indexing is zero-based

col = 1+int(sys.argv[1])

for input in sys.argv[2:]:

with open(input) as handle:

for row in csv.reader(handle):

writer.writerow(row[col])

To do: error handling, extraction of multiple columns. (Not hard per se; use row[2:5] to extract columns 3, 4, and 5; but I'm too lazy to write a proper command-line argument parser.)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值