linux分割csv文件路径,最快的方式轉換表分隔文件到csv在linux

最新推荐文章于 2024-02-19 20:00:00 发布

文昌读书小苑

最新推荐文章于 2024-02-19 20:00:00 发布

阅读量315

点赞数

文章标签： linux分割csv文件路径

I have a tab-delimited file that has over 200 million lines. What's the fastest way in linux to convert this to a csv file? This file does have multiple lines of header information which I'll need to strip out down the road, but the number of lines of header is known. I have seen suggestions for sed and gawk, but I wonder if there is a "preferred" choice.

我有一個以表分隔的文件，它有2億多行。linux中最快的將它轉換成csv文件的方法是什么?這個文件有多個標題信息行，我需要在路上把它們去掉，但是標題的行數是已知的。我看到了對sed和gawk的建議，但我不知道是否存在“首選”選項。

Just to clarify, there are no embedded tabs in this file.

澄清一下，這個文件中沒有嵌入的選項卡。

9 个解决方案

If all you need to do is translate all tab characters to comma characters, tr is probably the way to go.

如果您需要做的只是將所有制表符轉換為逗號字符，那么tr可能是正確的方法。

The blank space here is a literal tab:

這里的空格是一個文字標簽:

$ echo "hello world" | tr "\\t" ","

hello,world

Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.

當然，如果在文件中字符串文本中嵌入了制表符，這也會錯誤地翻譯它們;但是嵌入的文字標簽是相當不常見的。

If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:

如果您擔心嵌入的逗號，那么您需要使用稍微智能一點的方法。下面是一個Python腳本，它從stdin中提取TSV行並將CSV行寫入stdout:

import sys

import csv

tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)

commaout = csv.writer(sys.stdout, dialect=csv.excel)

for row in tabin:

commaout.writerow(row)

Run it from a shell as follows:

從shell中運行如下:

python script.py < input.tsv > output.csv

perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv

Perl is generally faster at this sort of thing than the sed, awk, and Python.

Perl通常比sed、awk和Python在處理這類事情時速度更快。

sed -e 's/"/\\"/g' -e 's//","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile

Damn the critics, quote everything, CSV doesn't care.

該死的批評家，引用一切，CSV都不關心。

is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.

是實際的制表符。對我沒用。在bash中,使用^ V進入它。

@ignacio-vazquez-abrams 's python solution is great! For people who are looking to parse delimiters other tab, the library actually allows you to set arbitrary delimiter. Here is my modified version to handle pipe-delimited files:

@ignacio-vazquez-abrams的python解決方案很棒!對於希望解析分隔符其他選項卡的人，庫實際上允許您設置任意的分隔符。這里是我的修改版本，以處理管道分隔文件:

import sys

import csv

pipein = csv.reader(sys.stdin, delimiter='|')

commaout = csv.writer(sys.stdout, dialect=csv.excel)

for row in pipein:

commaout.writerow(row)

If you want to convert the whole tsv file into a csv file:

如果您想將整個tsv文件轉換為csv文件: $ cat data.tsv | tr "\\t" "," > data.csv

If you want to omit some fields:

如果你想省略一些字段: $ cat data.tsv | cut -f1,2,3 | tr "\\t" "," > data.csv

The above command will convert the data.tsv file to data.csv file containing only the first three fields.

以上命令將轉換數據。tsv文件數據。csv文件只包含前三個字段。

assuming you don't want to change header and assuming you don't have embedded tabs

假設你不想改變標題，假設你沒有嵌入的標簽

# cat file

header header header

one two three

$ awk 'NR>1{$1=$1}1' OFS="," file

header header header

one,two,three

NR>1 skips the first header. you mentioned you know how many lines of header, so use the correct number for your own case. with this, you also do not need to call any other external commands. just one awk command does the job.

NR>1跳過第一個頭球。您提到您知道有多少行標題，所以請使用正確的數字作為您自己的例子。這樣，您也不需要調用任何其他外部命令。只有一個awk命令可以完成這項工作。

another way if you have blank columns and you care about that.

另一種方法，如果你有空白的列，你關心它。

awk 'NR>1{gsub("\t",",")}1' file

using sed

使用sed

sed '2,$y/\t/,/' file #skip 1 line header and translate (same as tr)

the following awk oneliner supports quoting + quote-escaping

下面的awk oneliner支持引用+引用-轉義

printf "flop\tflap\"" | awk -F '\t' '{ gsub(/"/,"\"\"\"",$i); for(i = 1; i <= NF; i++) { printf "\"%s\"",$i; if( i < NF ) printf "," }; printf "\n" }'

gives

給了

"flop","flap""""

I think it is better not to cat the file because it may create problem in the case of large file. The better way may be

我認為最好不要對文件進行cat檢查，因為在大文件的情況下可能會產生問題。更好的辦法可能是

$ tr ',' '\t' < csvfile.csv > tabdelimitedFile.txt

$ tr，' \t' < csvfile。csv > tabdelimitedFile.txt

The command will get input from csvfile.csv and store the result as tab seperated in tabdelimitedFile.txt

該命令將從csvfile獲得輸入。將結果作為標簽存儲在tabdelimitedFile.txt中