scrapy的增量爬虫(未完待续。。。)

最新推荐文章于 2024-04-09 17:47:36 发布

Are you ready

最新推荐文章于 2024-04-09 17:47:36 发布

阅读量420

点赞数

分类专栏：爬虫与数据分析文章标签： scrapy增量爬虫和中间件

爬虫与数据分析专栏收录该内容

24 篇文章 6 订阅

订阅专栏

增量爬虫
1、增量爬虫（crawlspider）

1）创建增量式爬虫：scrapy genspider -t crawl xxx xxx.xx

2）增量式爬虫介绍：
在scrapy中有许多的爬虫模板（例如：crawl，Feed等模板），这些模板可以对basic爬虫进行功能的扩充)，这些模板经过扩充以后可以更好的实现一些复杂功能，crawlspider是最常用的一种爬虫模板
3）增量式爬虫的运行机制：
basic模板运行机制：从start_urls中提取起始url，把这些url放入调度队列进行调度。
增量式模板运行机制：以start_urls中url为起点，从这些url的响应网页中根据一定的规则匹配出一批url，把匹配出的这批url放入到调度队列中；新产生网页中也会根据前面的规则来匹配新的url并且这些url如果没有和之前重复将其将入到调度队列中去，这样循环往复直至再也匹配不到新的url为止。

代码实现

spiders

#dushu
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
# 导入链接提取器类，从一个url的网页上根据一定的规则来提取新的链接

from scrapy.spiders import CrawlSpider, Rule
# CrawlSpider是spiders一个派生类，在基本爬虫的基础上扩展功能
# Rule规则对象，根据规则安排url的提取、组合与调度
from DushuPro.items import DushuproItem

class DushuSpider(CrawlSpider):
    name = 'dushu'
    allowed_domains = ['dushu.com']
    start_urls = ['https://www.dushu.com/book/1002.html']

    rules = (
        Rule(LinkExtractor(allow=r'/book/1002_\d\.html'), callback='parse_item', follow=True),
        # Rule(LinkExtractor(restrict_xpaths="//div[@class='pages']//a"), callback='parse_item', follow=True),
        # Rule(LinkExtractor(restrict_css=".pages a"), callback='parse_item', follow=True),
    )

    # rules属性：是一个元组，包含了若干个Rule对象
    # 每个Rule对象有三个参数，可以根据参数LinkExtractor对象里面的规则来匹配、提取并且组合、调度符合规则url；callback是回调函数（这个回调函数写法是写函数名的字符串）当对应的url请求完毕以后回调
    # LinkExtractor对象：链接提取器，用于根据一定的规则来提取链接，具体有如下三种：
    # 规则一：allow="xxx" 根据xxx这个正则表达式从网页源码上匹配新的链接
    # 规则二：restrict_xpaths="xxx" 根据xxx这个xpath路径来匹配链接
    # 规则三：restrict_css="xxx"  根据xxx这个css选择器来匹配
    # 【注意】在使用xpath或者css的时候，只需要匹配目标链接的那个a标签即可，不需要再xpath路径或者css选择器中写具体href属性

    def parse_item(self, response):
        booklist = response.xpath("//div[@class='bookslist']//li")
        for book in booklist:
            item = DushuproItem()
            item["title"] = book.xpath(".//h3/a/text()").extract_first()
            # extract_first()从selector列表将内容取出，然后从内容列表中取出首元素，如果列表为空，直接去None

            item["author"] = "".join(book.xpath(".//div[@class='book-info']/p[1]//text()").extract())
            # print(item)
            # 匹配出二级页面的链接
            next_url = "https://www.dushu.com" + book.xpath(".//h3/a/@href").extract_first()

            # 向二级页面发起请求
            yield scrapy.Request(url=next_url,callback=self.parse_Info,meta={"item":item})
    # 回调函数，用于解析下级页面
    def parse_Info(self, response):
        # 把上级页面送的item提取出来
        item = response.meta["item"]
        # 继续解析item
        item["price"] = response.xpath("//span[@class='num']/text()").extract_first()
        item["publisher"] = response.xpath("//div[@class='book-details-left']/table//tr[2]//a/text()").extract_first()
        item["authorInfo"] = response.xpath("//div[@class='text txtsummary']//text()").extract()[1]
        item["content"] = response.xpath("//div[@class='text txtsummary']//text()").extract()[0]
        item["mulu"] = "\n".join(response.xpath("//div[@class='text txtsummary']")[2].xpath(".//text()").extract())
        yield item

items

import scrapy

class DushuproItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()
    publisher = scrapy.Field()
    authorInfo = scrapy.Field()
    content = scrapy.Field()
    mulu = scrapy.Field()

中间件

下载中间件

1、下载中间件
下载中间件：主要工作在下载器向服务器发起请求的过程中，可以截获下载器的请求对请求作出相应的扩展与配置
下载中间件有两种：一种是系统自带的中间，位置在scrapy的核心引擎中，路径为：scrapy.dowloadermiddleweres.xxxx.xxxx；另外一种是自定义下载中间，它的位置就在我们的当前工程的middleweres中
下载中间件的开启：settings文件中
DOWNLOADER_MIDDLEWARES = {
‘DushuPro.middlewares.DushuproDownloaderMiddleware’: 543,
}
2、下载中间件的应用
1）植入selenium动态页面加载
在settings文件中将用于植入selenium的下载中间件激活同时为了节省系统开销，可把浏览器能够代替的那些组件的功能的中间件关闭掉
然后在selenium中间件类中重写响应方法来截获request，并且用selenium来代替其工作，最后把selenium中取出的解析以后的网页源码封装到响应数据对象中返回出去
spiders

import scrapy


class MoguSpider(scrapy.Spider):
    name = 'mogu'
    allowed_domains = ['mogu.com']
    start_urls = ['https://list.mogu.com/book/clothing/50240?acm=3.mce.1_10_1ko4s.132244.0.mtYuRrx6QL5ne.pos_1-m_482170-sd_119&ptp=31.nXjSr._head.0.UvbiJ3IU']

    def parse(self, response):
        goods_list = response.css(".iwf")
        print(len(goods_list))

        # 练习：解析内容
        pass

middlewares

from scrapy import signals
from selenium import webdriver
from time import sleep
from scrapy.http import HtmlResponse

class MogujieDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        print("当前创建出了一个爬虫对象！")
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):

        print("我是process_request方法，爬虫:%s的请求:%s正在经过当前下载中间件..."%(spider.name,request))
        # 由于系统的下载器不能解析js动态页面，我们在这里截获系统的请求对象，重新的定义其请求的过程，用selenium来操作
        driver = webdriver.Chrome()
        # 从截获的request中提取url
        url = request.url
        print("当前浏览器正在访问：",url)
        driver.get(url)
        sleep(1)
        # 下拉加载
        distance = 0
        for i in range(100):
            distance = i*500
            js = "document.documentElement.scrollTop=%d"%distance
            driver.execute_script(js)
            sleep(0.5)
        sleep(3)
        # 提取网页源码
        html = driver.page_source
        # 把网页源码封装到一个响应对象中返回出去
        res = HtmlResponse(url=driver.current_url,request=request,body=html,encoding='utf-8')
        return res

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
        print("我是process_response方法，爬虫%s的请求%s的响应对象%s正在被返回..."%(spider.name,request,response))
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
        print("有异常出现")
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        print("爬虫被打开！")
        spider.logger.info('Spider opened: %s' % spider.name)

settings

DOWNLOADER_MIDDLEWARES = {
   'Mogujie.middlewares.MogujieDownloaderMiddleware': 543,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
}

2）植入代理池
代理服务器：在客户端与服务器之间发生请求与响应的过程中，如果设置了代理服务器，则我们的请求会直接发到代理服务器端，由代理服务器代替客户端向服务器发起请求，服务器的响应数据也会响应给代理服务器，然后由代理服务器在把响应数据传递回客户端
代理服务器的获取：1）自己搭建（不推荐） 2）抓取免费的代理服务器（不靠谱） 3）付钱买

Are you ready

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
scrapy的增量爬虫(未完待续。。。)

增量爬虫1、增量爬虫（crawlspider）1）创建增量式爬虫：scrapy genspider -t crawl xxx xxx.xx2）增量式爬虫介绍：在scrapy中有许多的爬虫模板（例如：crawl，Feed等模板），这些模板可以对basic爬虫进行功能的扩充)，这些模板经过扩充以后可以更好的实现一些复杂功能，crawlspider是最常用的一种爬虫模板3）增量式爬虫...
复制链接

扫一扫