Scrapy爬虫框架, 基本使用

最新推荐文章于 2024-04-22 15:42:39 发布

一个爬坑的Coder

最新推荐文章于 2024-04-22 15:42:39 发布

阅读量478

点赞数 1

分类专栏： Python学习 # Python爬虫学习文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_39583550/article/details/112746805

版权

Python学习同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

Python爬虫学习

4 篇文章 0 订阅

订阅专栏

个人学习笔记

文章目录

1. 安装Scrapy

我的Python版本是3.8

首先安装twisted
```
pip install twisted
```
出现问题: 下载出现问题, 提示Microsoft Visual C++ 14.0 is required

解决办法: 使用离线包进行安装或者安装这个微软组件, 这里我是使用离线包

链接：https://pan.baidu.com/s/1Ork5rctPg6DsW7izpc_O6w
提取码：6666
然后使用命令
```
pip install "lib的路径"
```
安装scrapy
```
pip install scrapy
```

2. Scrapy五大组件

引擎Scrapy Engine(框架核心)

对整个系统的数据进行处理, 由它来控制调试器、下载器、爬虫.
调度器Scheduler

相当于URL队列, 用来存储要爬取的网址.
下载器Downloader

用于下载网页内容, 能够高速的下载网络内容, 效率高(Scrapy下载器是建立在twisted这个高效的异步模型上的).
爬虫Spiders

从特定的URL提出需要的信息, 即所谓的实体Item, 也可以提取链接, 用于下次爬取页面.
项目管道Pipeline

处理爬虫从网页提取出来的实体, 主要的功能是持久化实体、验证实体的有效性、清除不需要的信息

流程:
在这里插入图片描述

参考文档: https://blog.csdn.net/ck784101777/article/details/104468780

3. 初步使用

创建爬虫项目
```
scrapy startproject mysplider
```
创建爬虫文件
```
scrapy genspider baidu www.baidu.com
```

编写爬虫文件

import scrapy
class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']


    def parse(self, response):
        with open('test.html', 'w', encoding='utf-8') as f:
            f.write(response.body.decode())

项目截图:
运行爬虫
```
scrapy crawl baidu
```
结果: 没有生成test.html, 应该是百度认为是机器访问网页, 而不是人为的, 因此返回是预定好的robots.txt

解决办法: 修改settings.py, 再次访问就可以了

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#################### 身份伪造 ####################
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4381.7 Safari/537.36'

# Obey robots.txt rules
##################### 默认是True ####################
ROBOTSTXT_OBEY = False

USER_AGENT去百度页面复制一份即可

4. 选择器

中文文档: https://www.osgeo.cn/scrapy/topics/selectors.html

获取标签文本内容

获取匹配的第一个元素get()

获取所有匹配的元素getall()

官方提供的测试网页: https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

import scrapy

class ClientSpider(scrapy.Spider):
    name = 'client'
    start_urls = ['https://docs.scrapy.org/en/latest/_static/selectors-sample1.html']

    def parse(self, response):
        a = response.xpath('//a/text()').get()
        a_list = response.xpath('//a/text()').getall() 
        print(a)
        # Name: My image 1
        print(a_list)
        # ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

获取标签属性

# 获取所有href
a_href_list = response.xpath('//a/@href').getall()
print(a_href_list)

# 获取一个元素href, 两种方法
#第一种
a_href1 = response.xpath('//a/@href').get()
print(a_href1)

#第二种
# attrib它返回第一个匹配元素的属性
a_href2 = response.css('a').attrib['href']
print(a_href)

5. 项目(Items)

网页中内容其实就是一个实体(Item), 使用Item的目的就是从非结构化数据(网页中提取的杂乱的数据)中提取结构化的数据

Item其实可以看成C语言中的一个结构体,items 提供了抓取数据的容器,后面要将网页内容结构化的时候(也就是把网页中想要提取的内容包装成一个对象), 当然也可以直接一点使用{}

items.py文件:

import scrapy

class MyspiderItem(scrapy.Item):
    # define the fields for your item here like:
    a_text = scrapy.Field() # 要存放的是超链接的文本信息
    a_href = scrapy.Field() # 要存放的是超链接地址
    img_src = scrapy.Field() # 要存放的是图片地址

引入items报错问题:
在这里插入图片描述
解决办法:

client.py爬虫文件:

import scrapy
from myspider.items import MyspiderItem


class ClientSpider(scrapy.Spider):
    name = 'client'
    start_urls = ['https://docs.scrapy.org/en/latest/_static/selectors-sample1.html']

    def parse(self, response):
        a_list = response.xpath('//a')
        for a in a_list:
            item = MyspiderItem()
            item['a_text'] = a.xpath('./text()').get()
            item['a_href'] = a.xpath('./@href').get()
            item['img_src'] = a.xpath('./img/@src').get()
            print(item)

'''
{'a_href': 'image1.html',
 'a_text': 'Name: My image 1 ',
 'img_src': 'image1_thumb.jpg'}
{'a_href': 'image2.html',
 'a_text': 'Name: My image 2 ',
 'img_src': 'image2_thumb.jpg'}
{'a_href': 'image3.html',
 'a_text': 'Name: My image 3 ',
 'img_src': 'image3_thumb.jpg'}
{'a_href': 'image4.html',
 'a_text': 'Name: My image 4 ',
 'img_src': 'image4_thumb.jpg'}
{'a_href': 'image5.html',
 'a_text': 'Name: My image 5 ',
 'img_src': 'image5_thumb.jpg'}
'''

6. 项目管道(Pipeline)

用途:

清理HTML数据
验证抓取的数据（检查项目是否包含某些字段）
检查重复项（并删除它们）
将爬取的项目存储在数据库中

使用第一步要在设置中settings.py文件中开启pipeline:

# 300是权重, 越小越优先
ITEM_PIPELINES = {
   'myspider.pipelines.MyspiderPipeline': 300,
}

第二步编写pipelines.py

class MyspiderPipeline:
    def process_item(self, item, spider):
        print(item)

如何在pipelines.py文件中能够获取爬虫文件中产生的实例(Item)呢?

使用yield关键字返回实例(Item)

import scrapy
from myspider.items import MyspiderItem

class ClientSpider(scrapy.Spider):
    name = 'client'
    start_urls = ['https://docs.scrapy.org/en/latest/_static/selectors-sample1.html']

    def parse(self, response):
        item = MyspiderItem()
        a_list = response.xpath('//a')
        for a in a_list:
            item['a_text'] = a.xpath('./text()').get()
            item['a_href'] = a.xpath('./@href').get()
            item['img_src'] = a.xpath('./img/@src').get()
            yield item

7. 爬取文章项目实例

爬取网站: http://www.html-js.com

爬取内容: 文章标题, 内容详情, 翻页

编写实例(Item)items.py:

import scrapy

class MyspiderItem(scrapy.Item):
    # define the fields for your item here like:
    _id = scrapy.Field()  # mongo数据库每条数据必要的字段
    title = scrapy.Field()  # 文章标题
    author = scrapy.Field()  # 文章作者
    content = scrapy.Field()  # 文章内容

编写爬虫文件client.py:

import scrapy
from myspider.items import MyspiderItem
import time

class ClientSpider(scrapy.Spider):
    name = 'client'
    allowed_domains = ['www.html-js.com']
    start_urls = ['http://www.html-js.com']

    def parse(self, response):
        article_list = response.xpath('//div[@class="articles mod-white"]/article')
        for article in article_list:
            item = MyspiderItem()
            # 当前标签下面, 任意位置开始查找
            item['title'] = article.xpath('.//a[@class="entry-title"]/span/text()').get()
            item['author'] = article.xpath('.//a[@rel="author"]/text()').get()
            content_url = self.start_urls[0] + article.xpath('.//a[@class="entry-title"]/@href').get()

            # 请求详情页面
            yield scrapy.Request(
                url=content_url,
                callback=self.parse_content,
                meta={'item': item}
            )
        # 休息3s, 然后爬取下一页
        time.sleep(3)
        # 爬取下一页
        next_btn = response.xpath(
            '//div[@class="pagination clearfix"]/ul/li[last()]')
        if next_btn.xpath('./@class').get() != 'disabled':
            # 这里斜杠不能丢
            next_url = self.start_urls[0] + '/' + next_btn.xpath('./a/@href').get()
            yield scrapy.Request(
                url=next_url,
                callback=self.parse
            )

    def parse_content(self, response):
        item = response.meta['item']
        # 里面都是标签, 因此不能用text()
        item['content'] = "".join(response.xpath('//div[@class="entry-content"]/*').getall())
        yield item

项目管道pepeline.py存入mongo数据库:

from itemadapter import ItemAdapter
import pymongo

mongo = pymongo.MongoClient(host='localhost', port=27017)

class MyspiderPipeline:
    def process_item(self, item, spider):
        db = mongo.scrapy_database  # 指定数据库(scrapy_database), 如果没有就创建
        collection = db.front_end_collection  # 创建集合存放数据
        collection.insert_one(item)

结果截图:

完整代码: https://github.com/cjperfect/scrapy-demo

8. 爬取知乎个人简介(带cookie)

初始化项目
```
scrapy startproject zhihu
```

创建爬虫文件

scrapy genspider zhihu www.zhihu.com/people/edit

准备cookie和请求头

格式化cookie, 改成key: value格式

cookie = '自己的cookie'
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4381.7 Safari/537.36'
}
def formate_cookie(self):
    final_cookie = {}
    items = self.cookie.split(';')
    for item in items:
        obj = item.split('=');
        key = obj[0].replace(' ', '')  # 去掉空格
        value = obj[1]
        final_cookie [key] = value
        return final_cookie 
    # print(final_cookie )

发送带有headers和cookie的请求

首先要知道请求是谁发送的, 这样才能在它的身上带上header和cookie, cookie

Ctrl+左键点击start_urls, 查找start_urls发现是start_request方法

那就在start_request方法身上做文

def start_requests(self):
    cookie = self.formate_cookie()
    yield scrapy.Request(
        url=self.start_urls[0],
        cookies=cookie,
        headers=self.headers,
        callback=self.parse
    )

获取响应内容

def parse(self, response):
    information = response.xpath('//span[@class="ztext"]/text()').get()
	print(information)

开始运行爬虫
```
scrapy crawl zhihu
```
结果

完整代码zhihu.py文件

import scrapy

class ClientSpider(scrapy.Spider):
    name = 'client'
    start_urls = ['https://www.zhihu.com/people/edit']
    cookie = '你的cookie'
    headers = {
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4381.7 Safari/537.36'
    }

    def start_requests(self):
        cookie = self.formate_cookie()
        yield scrapy.Request(
            url=self.start_urls[0],
            cookies=cookie,
            headers=self.headers,
            callback=self.parse
        )

    def parse(self, response):
        information = response.xpath('//span[@class="ztext"]/text()').get()
        print(information)

    def formate_cookie(self):
        final_cookie = {}
        items = self.cookie.split(';')
        for item in items:
            obj = item.split('=');
            key = obj[0].replace(' ', '')  # 去掉空格
            value = obj[1]
            final_cookie[key] = value
        return final_cookie
        # print(finish_cookie)

9. POST模拟登录第一种(`scrapy.FormRequest`)

模拟登录GitHub: https://github.com/login

查看POST登录需要哪些参数

首先登录成功一次, 查看表单提交地址, 带过来的有哪些参数
准备爬取数据

编写代码

使用Scrapy.FormRequest()执行表单登录

import scrapy

class PostSpider(scrapy.Spider):
    name = 'post'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').get()
        webauthn_support = response.xpath('//input[@name="webauthn-iuvpaa-support"]/@value').get()
        commit = response.xpath('//input[@name="commit"]/@value').get()
        form_data = {
            'login': '账号',
            'password': '密码',
            'authenticity_token': authenticity_token,
            'webauthn_support': webauthn_support,
            'commit': commit
        }

        yield scrapy.FormRequest(
            url='https://github.com/session', # 指定提交地址
            formdata=form_data,
            callback=self.after_login
        )

    def after_login(self, response):
        with open('test.html', 'w', encoding='utf-8') as f:
            f.write(response.body.decode())

结果截图

10. POST模拟登录第二种(`scrapy.FormRequest.from_response`)

官方解释: web站点通常通过元素提供预填充的表单字段，例如与会话相关的数据或身份验证令牌(用于登录页面)。在抓取时，您希望自动预填充这些字段，并且只覆盖其中的几个字段，如用户名和密码。您可以对这个作业使用FormRequest.from_response()方法。

scrapy.FormRequest.from_response() 自动寻找表单并发送POST请求提交, 而scrapy.FormRequest需要指定提交地址.

import scrapy

class PostSpider(scrapy.Spider):
    name = 'post'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        form_data = {
            'login': '账号',
            'password': '密码',
        }

        yield scrapy.FormRequest.from_response(
            response,  # 自动从response中寻找form表单,并发送请求到表单action的url地址。
            formdata=form_data, # 这里填写你提供的数据, 网页中其它字段自动填充
            callback=self.after_login
        )

    def after_login(self, response):
        with open('test.html', 'w', encoding='utf-8') as f:
            f.write(response.body.decode())

scrapy.FormRequest.from_response()的参数在这里插入图片描述

formid=None, 通过id属性定位form表单

formname=None, 通过name属性定位form表单

formxpath=None, 通过XPath定位form表单

11. 下载中间件(`Downloader Middleware`)

下载器中间件是一个挂钩Scrapy的请求/响应处理的框架。这是一个轻量级的低级系统，用于全局更改Scrapy的请求和响应。

在这里插入图片描述

相应方法介绍官方文档

12. 下载中间件的使用案例

查询本地IP地址, 在请求时候做一个代理, 看看能否调用了下载中间件

查询本机IP

激活下载中间件settings.py

# 取消注释
DOWNLOADER_MIDDLEWARES = {
   'scrapyPost.middlewares.ScrapypostDownloaderMiddleware': 543,
}

编写代码middlewares.py

获取代理网址

获取User-Agent地址

from scrapy import signals
import random

class ScrapypostDownloaderMiddleware:
    ip_https_list = ['36.25.31.223:9999', '27.8.27.204:9999']
    ip_http_list = ['36.251.141.16:9999', '36.56.100.62:9999']
    user_agent_list = [
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:65.0) Gecko/20100101 Firefox/65.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763',
        'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
    ]

    # 请求下载时候调用
    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agent_list)
        # request.meta['proxy'] = '36.251.141.16:9999'
        return None

    # 请求得到响应的时候调用
    def process_response(self, request, response, spider):
        print('这里也可以获得response')
        return response

    # 请求出现异常的时候调用
    def process_exception(self, request, exception, spider):
        if request.url.split(':')[0] == 'http':
            request.meta['proxy'] = random.choice(self.ip_http_list)
        else:
            request.meta['proxy'] = random.choice(self.ip_https_list)
        return request

一个爬坑的Coder

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Scrapy爬虫框架, 基本使用

个人学习笔记文章目录1. 安装Scrapy2. Scrapy五大组件3. 初步使用1. 安装Scrapy我的Python版本是3.8首先安装twistedpip install twisted出现问题: 下载出现问题, 提示Microsoft Visual C++ 14.0 is required解决办法: 使用离线包进行安装或者安装这个微软组件, 这里我是使用离线包链接：https://pan.baidu.com/s/1Ork5rctPg6DsW7izpc_O6w提取码：66.
复制链接

扫一扫