UnicodeDecodeError

最新推荐文章于 2024-07-12 16:16:27 发布

one优雅的猫

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量113

点赞数 3

分类专栏：报错文章标签： python jupyter

本文链接：https://blog.csdn.net/m0_70021830/article/details/139885174

版权

报错专栏收录该内容

2 篇文章 0 订阅

订阅专栏

原代码

1 # 读取CSV文件到DataFrame
2 data = pd.read_csv('../data/test.csv')

出现问题

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte

问题出现原因

尝试读取的 CSV 文件包含未以 UTF-8 编码的字符。
UTF-8 是 pandas 在读取文件时使用的默认编码。

解决方法：

可以尝试在调用 pd.read_csv() 时指定不同的编码。

常见编码：latin1, iso-8859-1, cp1252等

data = pd.read_csv('../data/test.csv', encoding='latin1')

# 或者
data = pd.read_csv('../data/test.csv', encoding='iso-8859-1')

# 或者 
data = pd.read_csv('../data/test.csv', encoding='cp1252')

如果不是常见编码，先确认编码类型

import chardet
import pandas as pd

# 检测文件编码
file_path = '../data/test.csv'
with open(file_path, 'rb') as file:
    rawdata = file.read(10000)  # 读取文件的前10000字节
    result = chardet.detect(rawdata)
    encoding = result['encoding']

print(f"Detected encoding: {encoding}")

# 使用检测到的编码读取文件
data = pd.read_csv(file_path, encoding=encoding)

将文件转换为UTF-8编码

import chardet
import pandas as pd

# 检测文件编码
file_path = '../data/test.csv'
with open(file_path, 'rb') as file:
    rawdata = file.read(100000)  # 读取文件的前100000字节
    result = chardet.detect(rawdata)
    encoding = result['encoding']

print(f"Detected encoding: {encoding}")

# 如果编码被检测到
if encoding:
    with open(file_path, 'r', encoding=encoding, errors='replace') as file:
        content = file.read()

    # 将内容重新保存为UTF-8编码
    new_file_path = '../data/test_utf8.csv'
    with open(new_file_path, 'w', encoding='utf-8') as file:
        file.write(content)

    print(f"File has been re-encoded to UTF-8 and saved as '{new_file_path}'.")

    # 读取转换后的UTF-8文件
    data = pd.read_csv(new_file_path, encoding='utf-8')
    print(data.head())
else:
    print("Failed to detect encoding.")

清洗数据文件

def clean_file_content(file_path, output_path, encoding='utf-8'):
    with open(file_path, 'rb') as file:
        content = file.read().decode(encoding, errors='replace')
    
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(content)

input_path = '../data/test.csv'
output_path = 'cleaned_file.csv'
clean_file_content(input_path, output_path, encoding='latin1')

df = pd.read_csv(output_path, on_bad_lines='skip')