原代码
1 # 读取CSV文件到DataFrame
2 data = pd.read_csv('../data/test.csv')
出现问题
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte
问题出现原因
尝试读取的 CSV 文件包含未以 UTF-8 编码的字符。
UTF-8 是 pandas 在读取文件时使用的默认编码。
解决方法:
可以尝试在调用 pd.read_csv() 时指定不同的编码。
常见编码:latin1
, iso-8859-1
, cp1252等
data = pd.read_csv('../data/test.csv', encoding='latin1')
# 或者
data = pd.read_csv('../data/test.csv', encoding='iso-8859-1')
# 或者
data = pd.read_csv('../data/test.csv', encoding='cp1252')
如果不是常见编码,先确认编码类型
import chardet
import pandas as pd
# 检测文件编码
file_path = '../data/test.csv'
with open(file_path, 'rb') as file:
rawdata = file.read(10000) # 读取文件的前10000字节
result = chardet.detect(rawdata)
encoding = result['encoding']
print(f"Detected encoding: {encoding}")
# 使用检测到的编码读取文件
data = pd.read_csv(file_path, encoding=encoding)
将文件转换为UTF-8编码
import chardet
import pandas as pd
# 检测文件编码
file_path = '../data/test.csv'
with open(file_path, 'rb') as file:
rawdata = file.read(100000) # 读取文件的前100000字节
result = chardet.detect(rawdata)
encoding = result['encoding']
print(f"Detected encoding: {encoding}")
# 如果编码被检测到
if encoding:
with open(file_path, 'r', encoding=encoding, errors='replace') as file:
content = file.read()
# 将内容重新保存为UTF-8编码
new_file_path = '../data/test_utf8.csv'
with open(new_file_path, 'w', encoding='utf-8') as file:
file.write(content)
print(f"File has been re-encoded to UTF-8 and saved as '{new_file_path}'.")
# 读取转换后的UTF-8文件
data = pd.read_csv(new_file_path, encoding='utf-8')
print(data.head())
else:
print("Failed to detect encoding.")
清洗数据文件
def clean_file_content(file_path, output_path, encoding='utf-8'):
with open(file_path, 'rb') as file:
content = file.read().decode(encoding, errors='replace')
with open(output_path, 'w', encoding='utf-8') as file:
file.write(content)
input_path = '../data/test.csv'
output_path = 'cleaned_file.csv'
clean_file_content(input_path, output_path, encoding='latin1')
df = pd.read_csv(output_path, on_bad_lines='skip')