scrapy 练习

最新推荐文章于 2024-02-27 22:30:48 发布

梦想不能在远方

最新推荐文章于 2024-02-27 22:30:48 发布

阅读量754

点赞数

分类专栏： python 文章标签： scrapy python3 csdn json dict

本文链接：https://blog.csdn.net/hutiewei2008/article/details/121344762

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

scrapy版本：
初次学习是两三年之前，当时使用python2安装scrapy。scrapy版本1.7.3。
现在看安装指南已经要求python 3.6+。scrapy版本已经是2.5.1。指南地址请点击：github安装指南
现在使用肯定安装新版scrapy了。安装过程，这里不重复。这个博客主要记录使用scrapy开发过程。方便之后使用。本文使用python3。

由于我电脑是python2和python3环境都有。需要在执行命令前加上python3 -m。

scrapy新建项目

python3 -m scrapy startproject spidertest

创建的项目，目录结构如下
在这里插入图片描述

cd spidertest
python3 -m scrapy genspider test mp.csdn.net/mp_blog/manage/article?spm=1035.2022.3001.5448

/spidertest/spidertest/spiders目录下，创建了test.py文件。
具体查找页面中的具体元素，逻辑处理，可以在本文件中实现。

程序代码

# -*- coding: utf-8 -*-
import scrapy
import json


class TestSpider(scrapy.Spider):
    name = 'test'
    # 允许访问域
    allowed_domains = ['mp.csdn.net', 'bizapi.csdn.net']
    # 起始访问的url
    start_urls = ['http://mp.csdn.net/mp_blog/manage/article?spm=1035.2022.3001.5448/']
    headers = {'Connection': 'keep-alive',
               # 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
               #               'Chrome/96.0.4664.45 Safari/537.36',
               }

    def parse(self, response):
        url = 'https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20' \
              '&businessType=blog&orderby=&noMore=false&username=hutiewei2008'
        # header 指定头，可以使用不同header，根据需要设定
        # callback 调用函数
        req_base = scrapy.Request(url, headers=self.headers,
                                  meta={}, callback=self.page_list, dont_filter=True)
        # 挂起
        yield req_base

    def page_list(self, response):
        # 返回的text json转dict 字符串的json转成字典
        pagedict = json.loads(response.text)
        # 取字典数据
        pagedata = pagedict['data']
        pagelist = pagedata['list']
        # 循环取dict 数据
        for everypage in pagelist:
            url = everypage['url']
            print(url)
            title = everypage['title']
            view_count = everypage['viewCount']
            req_base = scrapy.Request(url, headers=self.headers, meta={'title': title, 'viewCount': view_count},
                                      callback=self.read_page, dont_filter=True)
            yield req_base
        with open('test.txt', 'a') as f:
            f.write('\r\n')
        f.close()

    @staticmethod
    def read_page(response):
        # scrapy 传参
        title = response.meta['title']
        # 字典中，title为字符串类型 有int类型 view_count
        view_count = response.meta['viewCount']
        # 写文件 追加存储保持数据
        with open('test.txt', 'a') as f:
            f.write(title + str(view_count) + '\t')
        print(title, view_count)
        f.close()
        pass

执行程序

python3 -m scrapy list 查看项目中爬虫列表
python3 -m scrapy crawl test

执行scrapy的爬虫项目。出现下面报错。不继续执行爬虫程序。
报错1：
Forbidden by robots.txt
将setting.py中
ROBOTSTXT_OBEY = True
改为ROBOTSTXT_OBEY = False
继续执行
python3 -m scrapy crawl test
程序执行成功。

报错2：
UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-2: ordinal not in range(128)
使用scrapy 1.7.3版本报此错误，使用python3 -m 的scrapy 2.5.1不报此错误。字符串乱码问题，之前使用转码解决。

程序备注

header 指定头，可以使用不同header，根据需要设定

‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36’
应该是能随机选择的，这块记不起

json转dict

import json
pagedict = json.loads(response.text)
网页返回json格式报文，使用json.loads将json格式字符串，转换成python的字典类型。

取字典数据

pagedata = pagedict[‘data’]
直接取dict的data标签数据。
循环的字典数据可用for everypage in pagelist循环获取。

scrapy传参

使用参数meta传参
原函数传值：meta={‘title’: title, ‘viewCount’: view_count}
取值：
title = response.meta[‘title’]
view_count = response.meta[‘viewCount’]

写文件

    with open('test.txt', 'a') as f:
        f.write(title + str(view_count) + '\t')
    f.close()