scrapy框架入门和使用实践

scrapy框架入门和使用实践

一、前言

  • 操作系统:Windows 10 专业版

  • 虚拟环境:Anaconda

  • python 版本:3.7

  • XPath 工具:xpath-helper

  • 开发工具:PyCharm 2020.1

  • 参考

scrapy 官网:https://scrapy.org/

scrapy 教程:https://docs.scrapy.org/en/latest/intro/tutorial.html

scrapy 架构:https://docs.scrapy.org/en/latest/topics/architecture.html

scrapy 调参:https://docs.scrapy.org/en/latest/topics/settings.html

XPath:https://www.w3school.com.cn/xpath/index.asp

Anaconda 教程:https://blog.csdn.net/u011424614/article/details/105579502

PyMySQL:https://pypi.org/project/PyMySQL/

二、正文

  • 一个开放源代码和协作框架,用于从网站提取所需数据。

1.架构简介

在这里插入图片描述

组件简介:

  • engine 组件:负责各个组件的通信和数据传输
  • spiders 组件:爬虫入口及网页解析
  • scheduler 组件:请求队列
  • downloader 组件:页面下载
  • item pipelines 组件:数据处理及存储
  • downloader middlewares:负责处理 engine 组件和 downloader 组件之间的请求处理和响应,可自定义扩展,例如:封装代理,修改 http 头信息
  • spiders middlewares:处理请求(输出)和响应(输入),可自定义扩展,可对输入和输出进行调整

数据流:

(1) (2) engine 组件将 spiders 组件的请求转发到 scheduler 组件进行排队

(3) (4) 排队完成后, engine 组件将请求转发给 downloader 组件进行页面下载

(5) (6) enginx 组件将 downlader 组件下载的页面转发给 spiders 组件进行解析

(7) (8) enginx 组件将 spiders 组件解析的数据转发到 item pipeline 组件中进行处理及存储

(1) (2) spiders 组件将解析的数据分为两部分,一部分 engine 组件转发到 item pipeline 组件进行处理及存储,另一部分 engine 组件转发到 scheduler 组件进行排队

(3) (4) 如果网页下载失败,engine 组件会重新将请求转发给 scheduler 组件进行排队

2.使用实践

场景说明:爬取豆瓣电影的 TOP 250 数据

1)创建项目

Anaconda 安装和操作,请查看 前言参考 链接

  • 指令创建和激活 scrapy 环境,并创建模板项目
# 创建 scrapy 环境
> conda create -n scrapy_env python=3.7 scrapy
#-- 激活 scrapy 环境
> activate scrapy_env
#-- 创建模板项目;scrapy startproject [项目名称]
> scrapy startproject scrapy_douban
  • 将项目导入 Pycharm 编辑器,python 环境切换为 scrapy 环境

2)配置文件

  • 修改 spiders/settings.py 文件
# Scrapy settings for scrapy_douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapy_douban'

SPIDER_MODULES = ['scrapy_douban.spiders']
NEWSPIDER_MODULE = 'scrapy_douban.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapy_douban.middlewares.ScrapyDoubanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    # 'scrapy_douban.middlewares.ScrapyDoubanDownloaderMiddleware': 543,
    # 'scrapy_douban.middlewares.proxy_ip': 544, # 代理IP 544 优先等级
    'scrapy_douban.middlewares.random_user_agent': 545, # 随机客户端信息
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'scrapy_douban.pipelines.ScrapyDoubanPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

mysql_host = '127.0.0.1'
mysql_port = 3306
mysql_dbname = 'python-db'
mysql_username = 'root'
mysql_pwd = '123456'
参数说明
USER_AGENT客户端信息
ROBOTSTXT_OBEYrobots 协议
CONCURRENT_REQUESTS请求并发量
DOWNLOAD_DELAY下载延迟
CONCURRENT_REQUESTS_PER_DOMAIN域名并发量
CONCURRENT_REQUESTS_PER_IPIP并发量
COOKIES_ENABLED是否使用 cookies;用于登录操作
DEFAULT_REQUEST_HEADERS默认请求头
SPIDER_MIDDLEWARESspider 中间件
DOWNLOADER_MIDDLEWARESdownloader 中间件
EXTENSIONS扩展中间件
ITEM_PIPELINESitem pipelines 组件

3)请求及解析

  • 进入项目的 spiders 目录,指令创建入口文件 douban_spider.py
#-- scrapy genspider [文件名] [入口域名]
> scrapy genspider douban_spider movie.douban.com
  • 项目目录下创建 main.py 作为程序启动的入口
from scrapy import cmdline

# 执行 cmd 命令,用于程序启动
cmdline.execute('scrapy crawl douban_spider'.split())

# 输出 csv 文件 ( 使用 notepad++ 修改编码为 UTF-8 BOM )
# cmdline.execute('scrapy crawl douban_spider -o douban.csv'.split())
  • 编辑 items.py (数据属性类)
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyDoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    serial_number = scrapy.Field() # 排名
    movie_name = scrapy.Field() # 电影名
    introduce = scrapy.Field() # 简介
    star = scrapy.Field() # 星级
    evaluate = scrapy.Field() # 评论数
    slogan = scrapy.Field() # 标语

  • 编辑 douban_spider.py (实现数据抓取和页面解析)
import scrapy
from scrapy_douban.items import ScrapyDoubanItem

class DoubanSpiderSpider(scrapy.Spider):
    name = 'douban_spider' # 爬出名称;不要跟项目名一样
    allowed_domains = ['movie.douban.com'] # 只抓取当前域名下的链接
    start_urls = ['https://movie.douban.com/top250'] # 入口链接

    # 解析下载组件返回的响应数据
    def parse(self, response):
        # print(response.text)
        # 获取 li 标签列表
        list = response.xpath("//div[@class='article']//ol[@class='grid_view']//li")
        # 循环解析 li 标签的数据
        for item in list:
            # print(item)
            douban_item = ScrapyDoubanItem()
            # 通过 XPath 获取数据项
            douban_item["serial_number"] = item.xpath(".//div[@class='item']//em//text()").extract_first()
            douban_item["movie_name"] = item.xpath(".//div[@class='info']//div[@class='hd']//a//span[1]//text()").extract_first()
            introduces = item.xpath(".//div[@class='info']//div[@class='bd']//p[1]//text()").extract() # 处理列表数据
            for introduce_item in introduces:
                i_content = "".join(introduce_item.split())
                douban_item["introduce"] = i_content
            douban_item["star"] = item.xpath(".//span[@class='rating_num']//text()").extract_first()
            douban_item["evaluate"] = item.xpath(".//div[@class='star']//span[4]//text()").extract_first()
            douban_item["slogan"] = item.xpath(".//p[@class='quote']//span//text()").extract_first()
            # print(douban_item)
            # 将数据提交到 pipelines.py(需要配置 settings.py 的 item pipelines 组件)
            yield douban_item
        # 获取"后页"的链接
        next_link = response.xpath("//span[@class='next']//link//@href").extract()
        # 判断是否为最后一页
        if next_link:
            next_link = next_link[0]
            # 将请求提交到 schedulers 组件,响应回调到 parse 方法
            yield scrapy.Request(self.start_urls[0] + next_link, callback=self.parse)

4)数据存储

  • 指令安装 PyMSQL
> conda install pymysql
  • 编辑 pipelines.py (数据处理及存储)
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql
from scrapy_douban.settings import mysql_dbname,mysql_host,mysql_port,mysql_pwd,mysql_username

class ScrapyDoubanPipeline:

    # 处理每一项数据
    def process_item(self, item, spider):
        # 插入数据
        self.insert(item)
        return item

    # 插入数据
    def insert(self, item):
        # 封装 sql 参数值
        values = (int(item["serial_number"]), item["movie_name"], item["introduce"]
                  , float(item["star"]), item["evaluate"], item["slogan"])
        # 连接数据库
        conn = pymysql.connect(host=mysql_host, user=mysql_username, password=mysql_pwd, port=mysql_port,
                                  db=mysql_dbname)
        # 获取游标
        cursor = conn.cursor()
        # 插入数据语句
        sql = 'INSERT INTO douban(serial_number,movie_name,introduce, star, evaluate, slogan) VALUES (%s, %s, %s, %s, %s, %s)'
        try:
            cursor.execute(sql, values)
            conn.commit()
            print("数据插入成功:"+str(values))
        except Exception as ex:
            print("出现如下异常:%s"%ex)
            conn.rollback()
            print("数据回滚:"+str(values))
        # 关闭数据库连接
        finally:
            conn.close()

5)隐藏身份

  • 代理IP(阿布云 http 隧道)和 随机客户端信息
  • 编辑 middlewares.py ,之后在 settings.py 中配置 DOWNLOADER_MIDDLEWARES
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import base64
import random

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class ScrapyDoubanSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ScrapyDoubanDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

# 代理IP
class proxy_ip(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'aaaaaaaaaa:1234'
        proxy_name_pwd = b'pppppppppp:xxxxxxxxx'
        encode_name_pwd = base64.b64encode(proxy_name_pwd)
        request.headers['Proxy-Authorization'] = 'Basic '+ encode_name_pwd.decode()

# 随机客户端信息
class random_user_agent(object):
    def process_request(self, request, spider):
        USER_AGENT_LIST = [
            'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',
            'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
            'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
            'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
            'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
            'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
            'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
            'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
        ]
        user_agent = random.choice(USER_AGENT_LIST)
        request.headers['User_Agent'] = user_agent

三、其它

1.XPath工具使用

第一种:

  • 点击 chrome 浏览器的工具栏(书签栏上面)的插件图标

  • 在 QUERY 输入框中输入 XPath 内容

第二种:

  • F12 打开 chrome 浏览器的调试栏
  • 定位元素节点
  • 元素节点右击 - Copy - Copy XPath
    在这里插入图片描述
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

趴着喝可乐

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值