Python数据爬取（Scrapy框架）

最新推荐文章于 2023-05-14 18:13:15 发布

晚春初夏的你

最新推荐文章于 2023-05-14 18:13:15 发布

阅读量566

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/weixin_42834505/article/details/108249262

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Python数据爬取（Scrapy框架）

常用数据爬取工具

第三方库实现爬取
	Requests、lxml
		灵活，简单
PySpider爬虫框架
	提供WebUI界面编写及管理爬虫
	上手快，学习简单
	对Windows操作系统支持很差
Scrapy爬虫框架
	功能强大
	可定制性强
	多线程，爬取效率高

安装配置Scrapy框架

安装Scrapy
	pip install scrapy
验证
C:\WINDOWS\system32>scrapy
Scrapy 2.3.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

配置Scrapy环境变量
	将Anaconda的Scripts文件夹加入到Path环境变量中

debug爬虫工程

使用Python脚本执行命令行启动爬虫
	在项目根目录添加脚本文件
	调用Scrapy框架的命令行执行方法启动爬虫
		cmdline模块
		execute()方法
			from scrapy.cmdline import execute
			execute('scrapy crawl example_spider'.split()

调试爬虫
	在parse()方法中设置断点
	使用Debug模式调试项目

scrapy框架组成

spiders文件夹
	定义爬虫文件
items.py
	定义框架内数据传输格式
pipelines.py
	数据保存模块
middlewares.py
	中间件模块
settings.py
	框架配置模块

Scrapy返回爬取页面的数据

通过解析方法返回爬取页面数据
	parse()方法的response参数
	response对象常用属性和方法

属性或方法	作用
url	当前返回数据所对应的页面url
status	http请求状态码
meta	用于request与response之间的数据传递
body	返回页面html源码，如用纯正则表达式匹配数据需要获得页面html源码
xpath()	使用xpath选择器解析网页
css()	使用css选择器解析网页

在Scrapy爬虫框架中提取网页数据的方法

xpath选择器
	用于选择XML文档中的节点的语言，可以与HTML一起使用
css选择器
	用于将样式应用于HTML文档的语言
	将样式与特定的HTML元素相关联
正则表达式
	提取非标签内容

Xpath语法：

xpath:使用路径表达式来选取 XML 文档中的节点或节点集

表达式	描述
nodename	选取的节点名
/	从根节点选取
//	选取所有符合条件的节点，而不考虑它们的位置
.	选取当前节点
…	选取当前节点的父节点
@	选取属性

谓语：谓语用来查找某个特定节点或者包含某个指定的值的节点
谓语被嵌在方括号中

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素
//title[@lang=‘eng’]	选取所有拥有值为 eng 的 lang 属性的 title 元素

路径表达式：

路径表达式	结果
/bookstore	选取根元素 bookstore
/bookstore/book	选取属于 bookstore 的子元素的所有 book 元素
//book	选取所有 book 子元素，而不管它们在文档中的位置
/bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置
//@lang	选取名为 lang 的所有属性
/bookstore/book/text()	选取属于 bookstore 的子元素的所有 book 元素的文本

Xpath选择器

Scrapy中xpath选择器
	基于lxml库
获取选择器的网页数据
	extract()
		提取selector列表中的网页数据
		如果列表为空，取下标为0的网页数据会抛出异常
	extract_first()
		提取selector列表中下标为0的网页数据
		如果列表为空，不会抛出异常，返回None

创建爬虫项目：

#首先进入个人本地的工作目录：
e:
cd E:\pycharm\pythonProject

#新建爬虫项目
scrapy satatproject 项目名[myscrapy]

#新建爬虫模板
cd myscrapy
scrapy genspider 模板名[example] example.com

# 用pycharm打开刚刚创建的项目

#在项目根目录中添加脚本文件
例如：run.py

在这里插入图片描述

run.py代码参考：

from scrapy.cmdline import execute
execute("scrapy crawl 模板名[example]".split())

example.py代码参考：

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        print("start parse")

        # 1 返回名人名言
        # mingyan = response.xpath("//div[@class='quote']/span[@class='text']/text()").extract_first()
        # print(mingyan)

        # 2 下一页 (迭代器)
        # next_page = response.xpath("//ul[@class='pager']/li[@class='next']/a/@href").extract_first()
        # if next_page is not None:
        #     pageNum = self.start_urls[0]+next_page
        #     yield scrapy.Request(pageNum,dont_filter=True)
        #     print(pageNum)

        # 3 人物的名言和名字（同一个页面）
        # total = response.xpath("//div[@class='quote']")
        # for i in total:
        #     author_text = i.xpath("./span[@class='text']/text()").extract_first()
        #     author_name = i.xpath("./span/small[@class='author']/text()").extract_first()
        #     print(author_text,author_name)

        # 4 任务的详细信息（名言，名字，出生时间和出生地）不同页面
        #   重点：不断请求或收到同一个页面，可以用dont_filter=True
        #   requset和response两端需要交互数据，可以用meta
        totals = response.xpath("//div[@class='quote']")
        for i in totals:
            author_text = i.xpath("./span[@class='text']/text()").extract_first()
            author_url = i.xpath("./span/a/@href").extract_first()
            url = self.start_urls[0]+author_url
            yield scrapy.Request(url,callback=self.parse_author_detail,dont_filter=True,meta={"mingyan":author_text,"url":url})

    # 自定义解析方法
    def parse_author_detail(self,response):
        url = response.meta["url"]
        jingju = response.meta["mingyan"]
        author_name = response.xpath("//h3[@class='author-title']/text()").extract_first()
        author_born_date = response.xpath("//span[@class='author-born-date']/text()").extract_first()
        author_born_location = response.xpath("//span[@class='author-born-location']/text()").extract_first()
        print(url,jingju,author_name,author_born_date,author_born_location,sep="\n")