python怎样删去csv中重复行_如何从CSV文件中删除重复的行?

Is there a simple way to find and remove duplicate rows from a CSV file?

Sample test.csv file:

row1 test tyy......

row2 tesg ghh

row2 tesg ghh

row2 tesg ghh

....

row3 tesg ghh

row3 tesg ghh

...

row4 tesg ghh

Expected results:

row1 test tyy......

row2 tesg ghh

....

row3 tesg ghh

...

row4 tesg ghh

Where can I start to accomplish this with PHP?

解决方案

A straight-to-the point method would be to read the file in line-by-line and keep track of each row you've previously seen. If the current row has already been seen, skip it.

Something like the following (untested) code may work:

// array to hold all "seen" lines

$lines = array();

// open the csv file

if (($handle = fopen("test.csv", "r")) !== false) {

// read each line into an array

while (($data = fgetcsv($handle, 8192, ",")) !== false) {

// build a "line" from the parsed data

$line = join(",", $data);

// if the line has been seen, skip it

if (isset($lines[$line])) continue;

// save the line

$lines[$line] = true;

}

fclose($handle);

}

// build the new content-data

$contents = '';

foreach ($lines as $line => $bool) $contents .= $line . "\r\n";

// save it to a new file

file_put_contents("test_unique.csv", $contents);

?>

This code uses fgetcsv() and uses a space comma as your column-delimiter (based on the sample-data in your question comment).

Storing every line that has been seen, as above, will assure to remove all duplicate lines in the file regardless of whether-or-not they're directly following one another or not. If they're always going to be back-to-back, a more simple method (and more memory conscious) would be to store only the last-seen line and then compare against the current one.

UPDATE (duplicate lines via the SKU-column, not full-line)

Based on sample data provided in a comment, the "duplicate lines" aren't actually equal (though they are similar, they differ by a good number of columns). The similarity between them can be linked to a single column, the sku.

The following is an expanded version of the above code. This block will parse the first line (column-list) of the CSV file to determine which column contains the sku code. From there, it will keep a unique list of SKU codes seen and if the current line has a "new" code, it will write that line to the new "unique" file using fputcsv():

// array to hold all unique lines

$lines = array();

// array to hold all unique SKU codes

$skus = array();

// index of the `sku` column

$skuIndex = -1;

// open the "save-file"

if (($saveHandle = fopen("test_unique.csv", "w")) !== false) {

// open the csv file

if (($readHandle = fopen("test.csv", "r")) !== false) {

// read each line into an array

while (($data = fgetcsv($readHandle, 8192, ",")) !== false) {

if ($skuIndex == -1) {

// we need to determine what column the "sku" is; this will identify

// the "unique" rows

foreach ($data as $index => $column) {

if ($column == 'sku') {

$skuIndex = $index;

break;

}

}

if ($skuIndex == -1) {

echo "Couldn't determine the SKU-column.";

die();

}

// write this line to the file

fputcsv($saveHandle, $data);

}

// if the sku has been seen, skip it

if (isset($skus[$data[$skuIndex]])) continue;

$skus[$data[$skuIndex]] = true;

// write this line to the file

fputcsv($saveHandle, $data);

}

fclose($readHandle);

}

fclose($saveHandle);

}

?>

Overall, this method is far-more memory friendly as it doesn't need to save a copy of every line in memory (only the SKU codes).

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值