2021/6/10爬虫第二十二次课（crawlspider、scrapy实现登录）

最新推荐文章于 2024-05-05 19:53:58 发布

笔记本IT

最新推荐文章于 2024-05-05 19:53:58 发布

阅读量239

点赞数 1

分类专栏：爬虫文章标签： crawlspider scrapy

本文链接：https://blog.csdn.net/httpsssss/article/details/117773171

版权

爬虫专栏收录该内容

15 篇文章 3 订阅

订阅专栏

文章目录

一、crawlspider
二、scrapy实现登录

一、crawlspider

引入：
回顾之前的代码中，我们有很大一部分时间在寻找下一页的url地址或者是详情页的url地址上面，这个过程能更简单一些么？

定义：
是scrpay另一种爬取数据的方式

学习目标：
了解crawlspider的使用
crawlspiser是继承与spider这个爬虫类

它的特点：
根据规则提取链接发送给引擎

如何创建crawlspider
scrapy genspider -t crawl xx xx.com
有些场景使用crawlspider还是比较方便 前提是什么 (url的规律是比较人容易用正则来实现的) [] 
正则表达式一定要写对

案例：
需求：1）进入首页 2）进入详情页获取诗歌名称
代码：（D:\python_spider\day22\ancient_poems\ancient_poems\spiders\poems.py）

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class PoemsSpider(CrawlSpider):
    name = 'poems'
    allowed_domains = ['gushiwen.cn','gushiwen.org']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    rules = (
        Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_[1,2].aspx'), follow=True),
        Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item', follow=True)
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        gsw_divs = response.xpath('//div[@class="left"]/div[@class="sons"]')

        for gsw_div in gsw_divs:
            title = gsw_div.xpath('.//h1/text()').get()
            print(title)


        return item

scrapy–Rule()与LinkExtractor()函数理解
 Python爬虫之crawlspider类的使用
 LinkExtractor的基本使用方法

二、scrapy实现登录

1 携带cookie向目标url发起请求

2 携带data(账号和密码)向目标url发送post请求

3 通过selenium来模拟登录 (input标签切换登录方式找到用户名和密码的输入框定位按钮)

2.1cookie方式

2.1.1在爬虫程序中

第一种方式 
目标url
https://user.qzone.qq.com/你的QQ号


总结：
1.通过分析源码找到了一个方法叫做 start_requests()方法直接携带cookes 再向start_urls发请求，这个时候response就是已经携带好cookie的了
3. 在scrapy中 cookie需要 k-v格式
cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}

代码：（D:\python_spider\day22\qq_zone\qq_zone\spiders\qq.py）

import scrapy
class QqSpider(scrapy.Spider):
    name = 'qq'
    allowed_domains = ['qq.com']
    start_urls = ['https://user.qzone.qq.com/你的QQ号']

    def start_requests(self):
        cookies='pgv_pvi=350465024; RK=n8qgPcxyTa; ptcz=c219dcd40cf2d30521a04833cdc036c2162182b9168f3f0886df3339dc6df90a; eas_sid=R185f9z9k6F4F9k2h8g8k938y9; pgv_pvid=8981388016; o_cookie=2023203294; pac_uid=1_2023203294; iip=0; LW_sid=s1K6x1T8D863J7z7S5d4U239z5; LW_uid=u1t6D1X8e8R3N7f7S5a432D9W6; tvfe_boss_uuid=6c52faf80d8c4e13; qz_screen=1536x864; QZ_FE_WEBP_SUPPORT=1; __Q_w_s__QZN_TodoMsgCnt=1; luin=o2023203294; lskey=0001000008d7bbb8a7b813d5133af48265d02b770a5f00e5b307d085e61c105c86e592d006ed430f90301efe; Loading=Yes; cpu_performance_v8=16; _qpsvr_localtk=0.7949891033531788; uin=o2023203294; skey=@IpSNpAXPA; p_uin=o2023203294; pt4_token=5YcSMdjZBJlAcnbjIsTSDRVIWqFdDYKPbMOgBRGXTpY_; p_skey=5pkKMqWMVgegrtwhibgQQme4rxPJVH6V4J2vXSRwq6Y_; 2023203294_todaycount=0; 2023203294_totalcount=5222; pgv_info=ssid=s7030297087'
        cookies =  {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        #{i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        print(cookies)

        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_1,
            cookies=cookies

        )

    def parse_1(self, response):
        with open('qzone.html', 'w', encoding='utf-8') as file_obj:
            file_obj.write(response.text)

2.1.2在下载中间件中

（D:\python_spider\day22\qq_zone\qq_zone\middlewares.py）
Request()方法中所有的参数都可以作为request对象的属性
Request()方法中所有的参数：url callback meta headers cookies
其中 headers cookies meta为字典

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        #设置cookies
        cookies='pgv_pvi=350465024; RK=n8qgPcxyTa; ptcz=c219dcd40cf2d30521a04833cdc036c2162182b9168f3f0886df3339dc6df90a; eas_sid=R185f9z9k6F4F9k2h8g8k938y9; pgv_pvid=8981388016; o_cookie=2023203294; pac_uid=1_2023203294; iip=0; LW_sid=s1K6x1T8D863J7z7S5d4U239z5; LW_uid=u1t6D1X8e8R3N7f7S5a432D9W6; tvfe_boss_uuid=6c52faf80d8c4e13; qz_screen=1536x864; QZ_FE_WEBP_SUPPORT=1; __Q_w_s__QZN_TodoMsgCnt=1; luin=o2023203294; lskey=0001000008d7bbb8a7b813d5133af48265d02b770a5f00e5b307d085e61c105c86e592d006ed430f90301efe; Loading=Yes; cpu_performance_v8=16; _qpsvr_localtk=0.7949891033531788; uin=o2023203294; skey=@IpSNpAXPA; p_uin=o2023203294; pt4_token=5YcSMdjZBJlAcnbjIsTSDRVIWqFdDYKPbMOgBRGXTpY_; p_skey=5pkKMqWMVgegrtwhibgQQme4rxPJVH6V4J2vXSRwq6Y_; 2023203294_todaycount=0; 2023203294_totalcount=5222; pgv_info=ssid=s7030297087'
        cookies =  {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        # {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        print(cookies)
        request.cookies=cookies

        return None

settings.py文件中需要编辑的：

#没打开时，用爬虫文件中的cookies
#打开后，False 用的是settings.py文件中DEFAULT_REQUEST_HEADERS的cookies
#打开后，True  中间件中的cookies
COOKIES_ENABLED = True

补充：设置代理IP：
在下载中间件中：（D:\python_spider\day22\qq_zone\qq_zone\middlewares.py）

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        #设置代理IP:  用meta  注意
        #以下是伪代码
        proxy=random.choice(A)
        request.meta['proxy']=proxy

        return None

2.2 post请求

普通方法：
关键在于：scrapy.FormRequest

import scrapy


class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        # authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').get()
        # print(authenticity_token)
        commit = 'Sign in'
        authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
        login = 'LogicJerry'
        password = '12122121zxl'
        timestamp = response.xpath("//input[@name='timestamp']/@value").extract_first()
        timestamp_secret = response.xpath("//input[@name='timestamp_secret']/@value").extract_first()
        data = {
            'commit': commit,
            'authenticity_token': authenticity_token,
            'login': login,
            'password': password,
            'webauthn-support': 'supported',
            'webauthn-iuvpaa-support': 'unsupported',
            'timestamp': timestamp,
            'timestamp_secret': timestamp_secret,
        }

        # 携带数据发送post请求  FormRequest表post请求
        yield scrapy.FormRequest(
            # 目标Url
            url='https://github.com/session',
            # 提交的数据
            formdata=data,
            # 响应的方法
            callback=self.after_login

        )

    def after_login(self,response):
        # 保存文件
        with open('github.html','w',encoding='utf-8') as file_obj:
            file_obj.write(response.text)

简单方法：
关键在于：scrapy.FormRequest.from_response

import scrapy


class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            # 请求的响应结果
            response=response,#注意
            # 提交数据
            formdata={'login':'LogicJerry','password':'12122121zxl'},
            # 回调方法
            callback=self.after_login
        )

    def after_login(self,response):
        # 保存文件
        with open('github2.html','w',encoding='utf-8') as file_obj:
            file_obj.write(response.text)

总结：
关注方法名称

笔记本IT

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
6
评论
2021/6/10爬虫第二十二次课（crawlspider、scrapy实现登录）

一、crawlspider引入：回顾之前的代码中，我们有很大一部分时间在寻找下一页的url地址或者是详情页的url地址上面，这个过程能更简单一些么？定义：是scrpay另一种爬取数据的方式学习目标：了解crawlspider的使用crawlspiser是继承与spider这个爬虫类它的特点：根据规则提取链接发送给引擎如何创建crawlspiderscrapy genspider -t crawl xx xx.com有些场景使用crawlspider还是比较方便前提是什么 (url的
复制链接

扫一扫