使用Scrapy爬取数据并保存为json、csv文件及乱码解决

最新推荐文章于 2023-05-24 22:27:22 发布

置顶紫蓝清秋

最新推荐文章于 2023-05-24 22:27:22 发布

阅读量4.3k

点赞数 15

分类专栏： Python scrapy 文章标签： python csv json

本文链接：https://blog.csdn.net/sqitarn/article/details/107022683

版权

Python 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

scrapy

1 篇文章 0 订阅

订阅专栏

创建scrapyporject

没安装好scrapy的可以看这里 docs.scrapy.org

scrapy startporject tutorial

执行成功后在目录下生产tutorial文件夹，结构目录如下：

在这里插入图片描述

在items.py中定义自己要抓取的数据：

items.py

#定义爬取对象属性
class ListItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()
    order = scrapy.Field()
    score = scrapy.Field()
pass

在spiders目录下创建mypiders.py文件，

chinaz
在这里插入图片描述

myspider.py

import scrapy
from tutorial1.items import ListItem

class ZhanzSpider(scrapy.Spider):

    # 设置name
    name = 'wangzhan'
    # 设置爬取域
    allowed_domain = ['chinaz.com']
    # 开始爬取数据的地址，这里的是chinaz网站排名，仅供学习
    start_urls = [
        'https://top.chinaz.com/all/',
    ]

    def parse(self, response):
        # 循环获取每一条数据，并与Item映射
        for sel in response.css('.listCentent li'):
            item = ListItem()
            # 获取每个需要的内容
            item['order'] = sel.css('.RtCRateCent strong::text').get()
            item['name'] = sel.css('.rightTxtHead a::text').get()
            item['url'] = sel.css('.col-gray::text').get()
            item['score'] = sel.css('.RtCRateCent span::text').get()
            print(item)
            yield item

        #获取下一页的连接
        next_page = response.xpath('//*[@id="content"]/div[3]/div[3]/div/div[2]/a[11]/@href').get()
        
        # 如果有下一页跳到下一页继续爬取
        if next_page is not None:
            yield response.follow(next_page, self.parse)

使用xpath的时候，获取tag的xpath可以使用浏览器的开发者工具F12 ,右键选择
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200629163824997.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NxaXRhcm4=,size_16,color_FFFFFF,t_70

配置pipelines.py处理爬取到的数据

存入json文件或csv文件

pipelines.py

import json, codecs, os, csv

# 保存为json文件
class JsonPipeline(object):
    def __init__(self):
        #文件的位置
        store_file = os.path.dirname(__file__) + '/spiders/wangzhan.json'
        # 打开文件，设置编码为utf-8
        self.file = codecs.open(filename= store_file, mode= 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) +',\n'
        # 逐行写入
        self.file.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()

# 保存为csv文件
class Pipiline_ToCSV(object):
    def __init__(self):
        #文件的位置
        store_file = os.path.dirname(__file__) + '/spiders/wangzhan.csv'
        #打开文件，并设置编码
        self.file = codecs.open(filename= store_file, mode= 'wb', encoding='utf-8')

        # 写入csv
        self.writer = csv.writer(self.file)

    def process_item(self, item, spider):
        line = (item['name'], item['order'], item['url'], item['score']) 
        # 逐行写入     
        self.writer.writerow(line)
        return item

    def close_spider(self, spider):
        self.file.close()

配置settings.py 使用pipeline对items里面获取的信息进行处理

settings.py

# 存储为csv 设置Pipiline_ToCSV
#ITEM_PIPELINES = {
#    'tutorial.pipelines.Pipiline_ToCSV' : 300,
#}

# 存储为json 设置JsonPipeline
ITEM_PIPELINES = {
    'tutorial.pipelines.JsonPipeline' : 300,
}
FEED_EXPORT_ENCODING='UTF-8'  #设置存储编码为utf-8,存json中文就不会乱码，不加就乱码

执行spider

scrapy crawl wangzhan

执行成功之后，就会在spider目录下生成wangzhan.json或wangzhan.csv

在这里插入图片描述
我这里是切换settings.py，所以两个文件都有

json
csv文件

这里提醒下，保存的csv文件使用Excel打开中文是乱码的，需要修改下编码

右击文件，选择记事本打开，选择文件>另存为>文件名不变，编码选择ANSI>点击保存，选择替换，重新打开wangzhan.csv 就是中文了

也可以不配置pipelines和seetings

#执行时在后面加上 -o 文件名
scrapy crawl wangzhan -o wangzhan.json
scrapy crawl wangzhan -o wangzhan.csv

以上就是使用scrapy爬取chinaz的网站排名，不是很难，主要是找准css或者xpth取到数据，希望对大家有所帮助，之前也遇到很多错误，找了些资料然后修正了过来，还有不对的地方请多多指正！

紫蓝清秋

关注

15
点赞
踩
73

收藏

觉得还不错? 一键收藏
4
评论
使用Scrapy爬取数据并保存为json、csv文件及乱码解决

安装scrapypip install scrapy创建porjectscrapy startporject tutorial执行成功后在目录下生产tutorial文件夹，结构目录如下：3. 在items.py中定义自己要抓取的数据：#定义爬取对象属性class ListItem(scrapy.Item): name = scrapy.Field() url = scrapy.Field() order = scrapy.Field() score.
复制链接

扫一扫

专栏目录