read_csv()报错

最新推荐文章于 2024-07-06 20:48:24 发布

山抹微云654

最新推荐文章于 2024-07-06 20:48:24 发布

阅读量1.3k

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/zhaoyin654/article/details/105992245

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

mark 一下：
代码： my_data_df = pd.read_csv("./question_data/math_question_information.csv") 报错
错误代码为：
Error in reading a csv file in pandas[CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file]
解决方案：

1. 可以先找出发生error的行(在文件的哪一行发生的error，直接定位)
import csv
with open(file_name, "rb") as f:
    reader = csv.reader(f)
    line_number = 1
    try:
        for row in reader:
            line_number += 1
    except Exception as e:
        print("error line: {}, the type of exception: {}, the message of 
               exception: {}".format(line_number, str(type(e)), e.message))

2. 直接查看pandas.read_csv()函数说明或源代码
    pandas.read_csv()文档相关参数说明：
        param: 
            lineterminatorstr: str type (length 1), optional 
                Character to break file into lines. Only valid with C parser.
    
    在 NLP 任务中，尤其是从网页中获取数据时，常常会伴随着HTML标签等信息混夹在文本中，如<\r>等
    信息，会与函数读取时按照“\r”分隔符相冲突，亦或者文本中包含有正则表达式等信息，也会引起冲突；
    在此，解决之道为设置lineterminator="\n"，使用换行符进行分割
       
    code:
        df = pandas.read_csv(file_name, lineterminator="\n")

3. 直接使用 Python 引擎进行读取，read_csv() 函数默认使用 C 引擎进行读取；读取的文件较大时，
    设置为 Python 引擎读取
    code:
        df = pandas.read_csv(file_name, engine="python)
    
    注：Python 引擎读取分隔符时会较慢

此三种方案亲测有效