Python爬虫编程8——Scrapy框架

彩色的泡沫

已于 2022-03-30 14:40:31 修改

阅读量822

点赞数

分类专栏： python爬虫编程文章标签：爬虫 python 数据挖掘

于 2022-03-11 13:27:50 首次发布

本文链接：https://blog.csdn.net/qq_52914337/article/details/123419702

版权

python爬虫编程专栏收录该内容

14 篇文章 11 订阅

订阅专栏

六.Scrapy.settings说明和配置

七.Scrapy CrawlSpider说明

下载图片的 Images Pipeline

十.Scrapy下载中间件

process_request(request,spider)

process_response(request,response,spider)

设置随机请求头

十一.Scrapy下载中间件+Selenium

一.Scrapy介绍

什么是Scrapy

Scrapy是一个为了爬取网站数据，提取结构性数据二编写的应用框架，我们只需要实现少量的代码，就能快速的进行抓取，Scrapy使用了Twisted异步网络框架，可以极大的加速我们的下载速度。

Scrapy官方文档

初窥Scrapy — Scrapy 1.0.5 文档http://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/overview.html

异步与非阻塞的区别

异步：调用在发出之后，这个调用就直接返回，不管有无结果

非阻塞：关注的是程序在等待调用结果时的状态，指在不能立刻得到结果之前，该调用不会阻塞当前线程

二.Scrapy工作流程

三.Scrapy入门

1 创建一个scrapy项目
scrapy startproject mySpider
cd mySpider

2 生成一个爬虫
scrapy genspider demo "demo.cn"

3 提取数据
完善spider 使用xpath等

4 保存数据
pipeline中保存数据

在命令行中运行爬虫

scrapy crawl qb     # qb爬虫的名字

在pycharm中运行爬虫

from scrapy import cmdline

cmdline.execute("scrapy crawl qb".split())

四.pipline的使用

从pipeline的字典形可以看出来，pipeline可以有多个，而且确实pipeline能够定义多个

为什么需要多个pipeline：

1 可能会有多个spider，不同的pipeline处理不同的item的内容

2 一个spider的内容可以要做不同的操作，比如存入不同的数据库中

注意：

1 pipeline的权重越小优先级越高

2 pipeline中process_item方法名不能修改为其他的名称

如何翻页

scrapy.Request知识点

scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None)

常用参数为：
callback：指定传入的URL交给那个解析函数去处理
meta：实现不同的解析函数中传递数据，meta默认会携带部分信息,比如下载延迟，请求深度
dont_filter:让scrapy的去重不会过滤当前URL，scrapy默认有URL去重功能，对需要重复请求的URL有重要用途

五.item的使用

1. 首先在item模块中定义要使用的属性字段

items.py

import scrapy

class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    position = scrapy.Field()
    date = scrapy.Field()

2.在爬虫文件中导入item模块中的类

from scrapy.http.response.html import HtmlResponse
# 21这个命名不规范  第一种导入方式 绝对路径导入
# from 21.MySpider.MySpider.items import MyspiderItem
# 第二种导入方式  在自己设定的根目录下导入
# from MySpider.items import MyspiderItem

3.实例化item创建对象就可以使用item中定义好的字段了

item = MyspiderItem()

六.Scrapy.settings说明和配置

为什么需要配置文件：

配置文件存放一些公共的变量(比如数据库的地址，账号密码等)

方便自己和别人修改

一般用全大写字母命名变量名 SQL_HOST = '192.168.0.1'

settings文件详细信息：Scrapy学习篇（八）之settings - cnkai - 博客园https://www.cnblogs.com/cnkai/p/7399573.html

七.Scrapy CrawlSpider说明

之前的代码中，我们有很大一部分时间在寻找下一页的URL地址或者内容的URL地址上面，这个过程能更简单一些吗？

思路：

1.从response中提取所有的标签对应的URL地址

2.自动的构造自己resquests请求，发送给引擎

目标：通过爬虫了解crawlspider的使用

生成crawlspider的命令：scrapy genspider -t crawl 爬虫名字域名

LinkExtractors链接提取器

使用LinkExtractors可以不用程序员自己提取想要的url，然后发送请求。这些工作都可以交给LinkExtractors，他会在所有爬的页面中找到满足规则的url，实现自动的爬取。

class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)

主要参数讲解：

allow：允许的url。所有满足这个正则表达式的url都会被提取。
deny：禁止的url。所有满足这个正则表达式的url都不会被提取。
allow_domains：允许的域名。只有在这个里面指定的域名的url才会被提取。
deny_domains：禁止的域名。所有在这个里面指定的域名的url都不会被提取。
restrict_xpaths：严格的xpath。和allow共同过滤链接

Rule规则类

定义爬虫的规则类

class scrapy.spiders.Rule(
    link_extractor, 
    callback = None, 
    cb_kwargs = None, 
    follow = None, 
    process_links = None, 
    process_request = None
)

主要参数讲解：

link_extractor：一个LinkExtractor对象，用于定义爬取规则。
callback：满足这个规则的url，应该要执行哪个回调函数。因为CrawlSpider使用了parse作为回调函数，因此不要覆盖parse作为回调函数自己的回调函数。
follow：指定根据该规则从response中提取的链接是否需要跟进。
process_links：从link_extractor中获取到链接后会传递给这个函数，用来过滤不需要爬取的链接。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class YgSpider(CrawlSpider):
    name = 'yg'
    allowed_domains = ['sun0769.com']
    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=0']

    rules = (
        Rule(LinkExtractor(allow=r'wz.sun0769.com/html/question/201811/\d+\.shtml'), callback='parse_item'),
        Rule(LinkExtractor(allow=r'http:\/\/wz.sun0769.com/index.php/question/questionType\?type=4&page=\d+'), follow=True),
    )

    def parse_item(self, response):
        item = {}
        item['content'] = response.xpath('//div[@class="c1 text14_2"]//text()').extract()
        print(item)

八.Scrapy模拟登录

1. 直接向目标url发送请求并携带cookie

import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    allowed_domains = ['qq.com']
    start_urls = ['https://user.qzone.qq.com/1097566154']

    # 携带cookie
    def start_requests(self):
        cookies = '_ga=GA1.2.1725491264.1617184478; pgv_pvid=5477592008; RK=364ATneaHc; ptcz=0a313f2bdd6331665eddc0e722991f9859878a5008817e4a960d5b4ab99357f7; tvfe_boss_uuid=18d15aea2302cf32; iip=0; pac_uid=0_fc2f75bbcf77a; o_cookie=1097566154; eas_sid=e1o6J339o5u7k4V9d3E72580y9; luin=o1097566154; lskey=00010000c14c93171ecc8364633b0f79398985d1df1cac66eafe92bc2524780c1333df21007c48468c0bdc77; LW_sid=s1x6k4X6S2W2j8G0D9v3T8d8H1; LW_uid=Y1V6N426D2W2r8j0v9U3z8h806; _qpsvr_localtk=0.6613989595256926; pgv_info=ssid=s139336820; ptui_loginuin=1097566154; uin=o1097566154; skey=@EOsiO4BQa; p_uin=o1097566154; pt4_token=p5oqxzXFq4WDD6xe7oVRqdnBMBYtaPXs*dOapmDmY7c_; p_skey=IlLf3cuGBcwnXTrY4g7B0eupYMsFK97XUU*Fk8IDKhg_; Loading=Yes; qz_screen=1920x1080; 1097566154_todaycount=0; 1097566154_totalcount=13974; QZ_FE_WEBP_SUPPORT=1; cpu_performance_v8=0; __Q_w_s__QZN_TodoMsgCnt=1'

        cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        # headers = {
        #     'cookies':cookies
        # }

        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse,
            # headers=headers,
            cookies=cookies
        )

    def parse(self, response):
        with open('qzone.html','w',encoding='utf-8') as f:
            f.write(response.text)

2. 向目标url发送post请求携带data(账号和密码)

import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').get()
        login = "1097566154@qq.com"
        password = "wq15290884759."
        timestamp = response.xpath('//input[@name="timestamp"]/@value').get()
        timestamp_secret = response.xpath('//input[@name="timestamp_secret"]/@value').get()


        data = {
            "commit": "Sign in",
            "authenticity_token": authenticity_token,
            "login":login,
            "password":password,
            "webauthn - support":"supported",
            "webauthn - iuvpaa - support":"unsupported",
            "timestamp":timestamp,
            "timestamp_secret":timestamp_secret,
        }

        # 携带数据发送post请求
        yield scrapy.FormRequest(
            url='https://github.com/session',
            formdata=data,
            callback=self.after_login
        )
    def after_login(self,response):
        with open('github.html','w',encoding='utf-8') as f:
            f.write(response.text)

3. 通过selenium来模拟登录(input标签，定位登录按钮) 在下载中间件中：

class SeleniumMiddleware:
    def process_request(self, request, spider):
        url = request.url
        print(url)
        driver = webdriver.Chrome()
        driver.get(url)
        time.sleep(2)

        driver.find_element_by_css_selector('#user_login').send_keys('1097566154@qq.com')
        driver.find_element_by_css_selector('#user_password').send_keys('wq15290884759.')
        driver.find_element_by_css_selector('#new_user > div > div > div > div:nth-child(4) > input').click()


        html = driver.page_source
        return HtmlResponse(url=request.url,
                            body=html,
                            request=request,
                            encoding='utf-8',
                            status=200)

九.Scrapy内置方法保存图片文件

scrapy为下载item中包含的文件提供了一个可重用的item pipelines,这些pipeline有些共同的方法和结构,一般来说你会使用Files Pipline或者Images Pipeline

下载文件的 Files Pipeline

使用Files Pipeline下载文件，按照以下步骤完成：

定义好一个Item，然后在这个item中定义两个属性，分别为file_urls以及files。files_urls是用来存储需要下载的文件的url链接，需要给一个列表
当文件下载完成后，会把文件下载的相关信息存储到item的files属性中。如下载路径、下载的url和文件校验码等
在配置文件settings.py中配置FILES_STORE，这个配置用来设置文件下载路径
启动pipeline：在ITEM_PIPELINES中设置scrapy.piplines.files.FilesPipeline:1

下载图片的 Images Pipeline

使用images pipeline下载文件步骤：

定义好一个Item，然后在这个item中定义两个属性，分别为image_urls以及images。image_urls是用来存储需要下载的文件的url链接，需要给一个列表
当文件下载完成后，会把文件下载的相关信息存储到item的images属性中。如下载路径、下载的url和图片校验码等
在配置文件settings.py中配置IMAGES_STORE，这个配置用来设置文件下载路径
启动pipeline：在ITEM_PIPELINES中设置scrapy.pipelines.images.ImagesPipeline:1

十.Scrapy下载中间件

下载中间件是scrapy提供用于用于在爬虫过程中可修改Request和Response，用于扩展scrapy的功能。

使用方法：

编写一个Download Middlewares和我们编写一个pipeline一样，定义一个类，然后再settings中开启；

Download Middlewares默认方法：处理请求,处理响应,对应两个方法；

process_request(self,request,spider):
    当每个request通过下载中间件时，该方法被调用

process_response(self,request,response,spider):
    当下载器完成http请求，传递响应给引擎的时候调用

process_request(request,spider)

当每个Request对象经过下载中间件时会被调用，优先级越高的中间件，越先调用；该方法应该返回以下对象：None/Response对象/Request对象/抛出IgnoreRequest异常

返回None：scrapy会继续执行其他中间件相应的方法；
返回Response对象：scrapy不会再调用其他中间件的process_request方法,也不会去发起下载,而是直接返回该Response对象
返回Request对象：scrapy不会再调用其他中间件的process_request()方法,而是将其放置调度器待调度下载
如果这个方法抛出异常,则会调用process_exception方法

process_response(request,response,spider)

当每个Response经过下载中间件会被调用，优先级越高的中间件，越晚被调用，与process_request()相反；该方法返回以下对象：Response对象/Request对象/抛出IgnoreRequest异常。

返回Response对象：scrapy会继续调用其他中间件的process_response方法；
返回Request对象：停止中间器调用，将其放置到调度器待调度下载；
抛出IgnoreRequest异常：Request.errback会被调用来处理函数，如果没有处理，它将会被忽略且不会写进日志。

设置随机请求头

爬虫在频繁访问一个页面的时候,这个请求如果一直保持一致。那么很容易被服务器发现,从而禁止掉这个请求头的访问。因此我们要在访问这个页面之前随机的更改请求头,这样才可以避免爬虫被抓。随机更改请求头,可以在下载中间件实现。在请求发送给服务器之前,随机的选择一个请求头。这样就可以避免总使用一个请求头。

在middlewares.py文件中

class RandomUserAgent(object):
    def process_request(self,request,spider):
        useragent = random.choice(spider.settings['USER_AGENTS'])
        request.headers['User-Agent'] = useragent

class CheckUserAgent(object):
    def process_response(self,request,response,spider):
        print(request.headers['User-Agent'])
        return response

请求头网址：http://www.useragentstring.com/pages/useragentstring.php?typ=Browsehttp://www.useragentstring.com/pages/useragentstring.php?typ=Browser

USER_AGENTS = [ "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" ]

十一.Scrapy下载中间件+Selenium

利用Scrapy+Selenium爬取网易新闻实例：

爬虫文件：

import scrapy
from copy import deepcopy
from selenium import webdriver


class WySpider(scrapy.Spider):
    name = 'wy'
    allowed_domains = ['163.com']
    start_urls = ['https://news.163.com/']
    model_urls = []
    # 加载驱动
    driver = webdriver.Chrome()

    def parse(self, response):
        li_list = response.xpath('//div[@class="ns_area list"]/ul/li')
        # 筛选出想要的界面
        li_index = [2, 3, 5, 6]
        for index in li_index:
            li = li_list[index]
            item = {}
            item['大分类'] = li.xpath('./a/text()').get()
            url = li.xpath('./a/@href').get()
            # print(item, url)
            self.model_urls.append(url)
            # print(self.model_urls)
            yield scrapy.Request(
                url=url,
                callback=self.parse_html,
                meta={'item': deepcopy(item)}
            )

    def parse_html(self, response):
        item = response.meta['item']
        # print(item)
        """列表页面"""
        # 这个response是下载中间件拦截到请求，然后处理给返回的
        # print(response.text)
        # 匹配一个属性值中包含的字符串
        div_list = response.xpath('//div[contains(@class, "data_row")]')
        for div in div_list:
            detail_title = div.xpath('.//h3/a/text()').extract_first()
            detail_url = div.xpath('.//h3/a/@href').extract_first()
            # item = {}

            item['title'] = detail_title
            item['url'] = detail_url
            print(item)

            yield scrapy.Request(
                url=detail_url,
                callback=self.parse_detail,
                meta={'item': item}
            )

    def parse_detail(self, response):
        """详情页面"""
        print(response.text)


    @staticmethod
    def close(spider, reason):
        spider.bro.quit()

中间件文件：

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
import time
from scrapy.http.response.html import HtmlResponse

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class SeSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class SeDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class WangYiDownloaderMiddleware:
    def process_request(self, request, spider):
        '''拦截4个板块的请求'''
        # print(spider.model_urls)
        # print(request.url, "request.url")
        if request.url in spider.model_urls:
            # 这些请求应该拦截，用selenium去处理
            driver = spider.driver
            driver.get(request.url)
            time.sleep(3)

            # 获取页面当前的高度
            current_height = driver.execute_script("return document.body.scrollHeight;")

            # 向下滑动
            while True:
                # 滑到底部
                driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
                time.sleep(3)
                new_height = driver.execute_script("return document.body.scrollHeight;")

                if new_height == current_height:
                    break
                current_height = new_height

            # 页面滑到底部了
            try:
                driver.find_element_by_xpath('//div[@class="post_addmore"]/span').click()
                time.sleep(2)
            except:
                pass

            # 当前获取到的页面返回response
            return HtmlResponse(url=driver.current_url, body=driver.page_source, request=request, encoding='utf-8')

处理ajax动态加载数据比较方便。

彩色的泡沫

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫编程8——Scrapy框架

一.Scrapy介绍什么是ScrapyScrapy是一个为了爬取网站数据，提取结构性数据二编写的应用框架，我们只需要实现少量的代码，就能快速的进行抓取，Scrapy使用了Twisted异步网络框架，可以极大的加速我们的下载速度。Scrapy官方文档初窥Scrapy — Scrapy 1.0.5 文档http://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/overview.html异步与非阻塞的区别异步：调用在发出之后，这个调用就直
复制链接

扫一扫

专栏目录