合并上百万条csv文件:
import os
import pandas as pd
def find_csv():
#找寻当前文件夹内,后缀为.csv的文件
path_list = [x for x in os.listdir('.')
if os.path.isfile(x) and os.path.splitext(x)[1] == '.csv']
return path_list
if __name__ == '__main__':
csvpath_list = find_csv()
data = pd.DataFrame()
for csv_file in csvpath_list:
df = pd.read_csv(csv_file,encoding='ANSI')
df_data = pd.DataFrame(df)
data = pd.concat([data,df_data])
data.to_csv('output.csv',index = False,encoding='utf-8-sig')
注:需在直接目录下
代码块:
error:
numpy.core._exceptions.MemoryError: Unable to allocate 482. MiB for an array with shape (9, 7016359) and data type int64
A:虚拟内存不足
error:
UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1: No map...
A:删除所有文件汉字部分
读取百万级tsv数据文件并拆分为多个csv
import pandas as pd
# read DataFrame
data = pd.read_csv("zhuli202401.tsv",encoding='gb18030')
# number of csv files along with the row
k = 2
size = 1000000
for i in range(k):
df = data[size*i:size*(i+1)]
df.to_csv(f'zhuli{i+1}.csv', index=False)
file1 = pd.read_csv("zhuli1.csv")
print(file1)
print("\n")
file2 = pd.read_csv("zhuli2.csv")
print(file2)
以上代码会出现一个问题:
多列数据变成一列了
修改如下(加上sep就可以了,可以在运行窗口看一下各单元格的分隔是什么,我显示的/t):
data = pd.read_csv("zhuli202401.tsv",sep='\t',encoding='gb18030')
error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte
A:增加encoding,data = pd.read_csv(“zhuli202401.tsv”,encoding=‘gb18030’)
其他
Q:
sys:1: DtypeWarning: Columns (14) have mixed types.Specify dtype option on import or set low_memory=False.
A:增加参数low_memory
data = pd.read_csv(‘202311.tsv’,sep=‘,’,header=None,low_memory=False)