Scrapy之CsvItemExporter生成的CSV文件乱码解决

最新推荐文章于 2024-05-02 17:41:07 发布

bladestone

最新推荐文章于 2024-05-02 17:41:07 发布

阅读量1.1k

点赞数 1

分类专栏：数据爬虫问题分析文章标签： python 乱码 csv scrapy

本文链接：https://blog.csdn.net/blueheart20/article/details/108374817

版权

问题分析同时被 2 个专栏收录

72 篇文章 2 订阅

订阅专栏

数据爬虫

34 篇文章 3 订阅

订阅专栏

环境信息

Python 3.6.5
Scrapy 2.2

导出文件逻辑

self.file = open("/Users/chenjunfeng02/Downloads/enrolldata.csv", "wb")
        self.exporter = CsvItemExporter(self.file,
                fields_to_export=["provinceCode", "provinceName", "collegeCode", "collegeName"])
        self.exporter.start_exporting()

上述代码，可以正常导出数据内容，但是导出的内容，使用Excel打开之时，确是显示为乱码。
针对这些乱码的处理办法，如果使用excel自身的能力，则可以参阅笔者前一篇的文字自行转换处理Excel解决CSV文件中的乱码。
如何从代码层面直接将其写入为正确的编码呢？

尝试1: 文件写入的模式

这里使用open打开文件，是否可以在这里设置encoding模式呢？尝试如下：

self.file = open("/Users/chenjunfeng02/Downloads/enrolldata.csv", "w", encoding="utf-8")

正常情况下，是可以设置其写模式，设置encoding编码。
但是在Scrapy中要求使用而二进制模式写入，即文件打开的模式需要使用"w+b"。在二进制模式下，是不能设置encoding的编码格式的，所以这条路行不通…
既然是二进制写入模式，那就意味着CsvItemExporter本身是先将Item转换为文本，然后再按照二进制流方式写入文件的，即可以在CsvItem转换过程中，处理编码转换问题。查看官方文档：

class scrapy.exporters.CsvItemExporter(file, include_headers_line=True, join_multivalued=',', **kwargs)
Parameters:
        file – the file-like object to use for exporting the data. Its write method should accept bytes (a disk file opened in binary mode, a io.BytesIO object, etc)
        include_headers_line (str) – If enabled, makes the exporter output a header line with the field names taken from BaseItemExporter.fields_to_export or the first exported item fields.
        join_multivalued – The char (or chars) that will be used for joining multi-valued fields, if found.

没有发现合适的参数，可以查看基类：BaseItemExporter

class scrapy.exporters.BaseItemExporter(fields_to_export=None, export_empty_fields=False, encoding='utf-8', indent=0, dont_fail=False)

其中encoding的说明如下：

The encoding that will be used to encode unicode values. 
This only affects unicode values (which are always serialized to str using this encoding).
 Other value types are passed unchanged to the specific serialization library.

正确做法

self.file = open("/Users/chenjunfeng02/Downloads/enrolldata.csv", "wb")
        self.exporter = CsvItemExporter(self.file, encoding='utf-8-sig',
                fields_to_export=["provinceCode", "provinceName", "collegeCode", "collegeName"])
        self.exporter.start_exporting()

编码格式说明

”utf-8“ 是以字节为编码单元,它的字节顺序在所有系统中都是一样的,没有字节序问题,因此它不需要BOM,所以当用"utf-8"编码方式读取带有BOM的文件时,它会把BOM当做是文件内容来处理, 也就会发生类似上边的错误.
“uft-8-sig"中sig全拼为 signature 也就是"带有签名的utf-8”, 因此"utf-8-sig"读取带有BOM的"utf-8文件时"会把BOM单独处理,与文本内容隔离开,也是我们期望的结果.

实际测试了一下，使用utf-8写入仍然会出现乱码，正确的做法是使用utf-8-sig.
关于utf-8-sig常见的异常问题：
FileNotFoundError: [Errno 2] No such file or directory: ‘\ufeffA.txt’
这里的\ufeff即为所属的签名信息。