2021/6/10爬虫第二十二次课(crawlspider、scrapy实现登录)

一、crawlspider

引入:
回顾之前的代码中,我们有很大一部分时间在寻找下一页的url地址或者是详情页的url地址上面,这个过程能更简单一些么?

定义:
是scrpay另一种爬取数据的方式

学习目标:
了解crawlspider的使用
crawlspiser是继承与spider这个爬虫类

它的特点:
根据规则提取链接发送给引擎

如何创建crawlspider
scrapy genspider -t crawl xx xx.com
有些场景使用crawlspider还是比较方便 前提是什么 (url的规律是比较人容易用正则来实现的) [] 
正则表达式一定要写对

案例:
需求:1)进入首页 2)进入详情页获取诗歌名称
代码:(D:\python_spider\day22\ancient_poems\ancient_poems\spiders\poems.py)

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class PoemsSpider(CrawlSpider):
    name = 'poems'
    allowed_domains = ['gushiwen.cn','gushiwen.org']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    rules = (
        Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_[1,2].aspx'), follow=True),
        Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item', follow=True)
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        gsw_divs = response.xpath('//div[@class="left"]/div[@class="sons"]')

        for gsw_div in gsw_divs:
            title = gsw_div.xpath('.//h1/text()').get()
            print(title)


        return item

scrapy–Rule()与LinkExtractor()函数理解
Python爬虫之crawlspider类的使用
LinkExtractor的基本使用方法

二、scrapy实现登录

1 携带cookie向目标url发起请求

2 携带data(账号和密码)向目标url发送post请求

3 通过selenium来模拟登录 (input标签 切换登录方式 找到用户名和密码的输入框 定位按钮)

2.1cookie方式

2.1.1在爬虫程序中

第一种方式 
目标url
https://user.qzone.qq.com/你的QQ号

​
总结:
1.通过分析源码找到了一个方法叫做 start_requests()方法直接携带cookes 再向start_urls发请求,这个时候response就是已经携带好cookie的了
3. 在scrapy中 cookie需要 k-v格式
cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}

代码:(D:\python_spider\day22\qq_zone\qq_zone\spiders\qq.py)

import scrapy
class QqSpider(scrapy.Spider):
    name = 'qq'
    allowed_domains = ['qq.com']
    start_urls = ['https://user.qzone.qq.com/你的QQ号']

    def start_requests(self):
        cookies='pgv_pvi=350465024; RK=n8qgPcxyTa; ptcz=c219dcd40cf2d30521a04833cdc036c2162182b9168f3f0886df3339dc6df90a; eas_sid=R185f9z9k6F4F9k2h8g8k938y9; pgv_pvid=8981388016; o_cookie=2023203294; pac_uid=1_2023203294; iip=0; LW_sid=s1K6x1T8D863J7z7S5d4U239z5; LW_uid=u1t6D1X8e8R3N7f7S5a432D9W6; tvfe_boss_uuid=6c52faf80d8c4e13; qz_screen=1536x864; QZ_FE_WEBP_SUPPORT=1; __Q_w_s__QZN_TodoMsgCnt=1; luin=o2023203294; lskey=0001000008d7bbb8a7b813d5133af48265d02b770a5f00e5b307d085e61c105c86e592d006ed430f90301efe; Loading=Yes; cpu_performance_v8=16; _qpsvr_localtk=0.7949891033531788; uin=o2023203294; skey=@IpSNpAXPA; p_uin=o2023203294; pt4_token=5YcSMdjZBJlAcnbjIsTSDRVIWqFdDYKPbMOgBRGXTpY_; p_skey=5pkKMqWMVgegrtwhibgQQme4rxPJVH6V4J2vXSRwq6Y_; 2023203294_todaycount=0; 2023203294_totalcount=5222; pgv_info=ssid=s7030297087'
        cookies =  {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        #{i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        print(cookies)

        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_1,
            cookies=cookies

        )

    def parse_1(self, response):
        with open('qzone.html', 'w', encoding='utf-8') as file_obj:
            file_obj.write(response.text)

2.1.2在下载中间件中

(D:\python_spider\day22\qq_zone\qq_zone\middlewares.py)
Request()方法中所有的参数都可以作为request对象的属性
Request()方法中所有的参数:url callback meta headers cookies
其中 headers cookies meta为字典

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        #设置cookies
        cookies='pgv_pvi=350465024; RK=n8qgPcxyTa; ptcz=c219dcd40cf2d30521a04833cdc036c2162182b9168f3f0886df3339dc6df90a; eas_sid=R185f9z9k6F4F9k2h8g8k938y9; pgv_pvid=8981388016; o_cookie=2023203294; pac_uid=1_2023203294; iip=0; LW_sid=s1K6x1T8D863J7z7S5d4U239z5; LW_uid=u1t6D1X8e8R3N7f7S5a432D9W6; tvfe_boss_uuid=6c52faf80d8c4e13; qz_screen=1536x864; QZ_FE_WEBP_SUPPORT=1; __Q_w_s__QZN_TodoMsgCnt=1; luin=o2023203294; lskey=0001000008d7bbb8a7b813d5133af48265d02b770a5f00e5b307d085e61c105c86e592d006ed430f90301efe; Loading=Yes; cpu_performance_v8=16; _qpsvr_localtk=0.7949891033531788; uin=o2023203294; skey=@IpSNpAXPA; p_uin=o2023203294; pt4_token=5YcSMdjZBJlAcnbjIsTSDRVIWqFdDYKPbMOgBRGXTpY_; p_skey=5pkKMqWMVgegrtwhibgQQme4rxPJVH6V4J2vXSRwq6Y_; 2023203294_todaycount=0; 2023203294_totalcount=5222; pgv_info=ssid=s7030297087'
        cookies =  {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        # {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
        print(cookies)
        request.cookies=cookies

        return None

settings.py文件中需要编辑的:

#没打开时,用爬虫文件中的cookies
#打开后,False 用的是settings.py文件中DEFAULT_REQUEST_HEADERS的cookies
#打开后,True  中间件中的cookies
COOKIES_ENABLED = True

补充:设置代理IP:
在下载中间件中:(D:\python_spider\day22\qq_zone\qq_zone\middlewares.py)

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        #设置代理IP:  用meta  注意
        #以下是伪代码
        proxy=random.choice(A)
        request.meta['proxy']=proxy

        return None

2.2 post请求

普通方法:
关键在于:scrapy.FormRequest

import scrapy


class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        # authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').get()
        # print(authenticity_token)
        commit = 'Sign in'
        authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
        login = 'LogicJerry'
        password = '12122121zxl'
        timestamp = response.xpath("//input[@name='timestamp']/@value").extract_first()
        timestamp_secret = response.xpath("//input[@name='timestamp_secret']/@value").extract_first()
        data = {
            'commit': commit,
            'authenticity_token': authenticity_token,
            'login': login,
            'password': password,
            'webauthn-support': 'supported',
            'webauthn-iuvpaa-support': 'unsupported',
            'timestamp': timestamp,
            'timestamp_secret': timestamp_secret,
        }

        # 携带数据发送post请求  FormRequest表post请求
        yield scrapy.FormRequest(
            # 目标Url
            url='https://github.com/session',
            # 提交的数据
            formdata=data,
            # 响应的方法
            callback=self.after_login

        )

    def after_login(self,response):
        # 保存文件
        with open('github.html','w',encoding='utf-8') as file_obj:
            file_obj.write(response.text)

简单方法:
关键在于:scrapy.FormRequest.from_response

import scrapy


class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            # 请求的响应结果
            response=response,#注意
            # 提交数据
            formdata={'login':'LogicJerry','password':'12122121zxl'},
            # 回调方法
            callback=self.after_login
        )

    def after_login(self,response):
        # 保存文件
        with open('github2.html','w',encoding='utf-8') as file_obj:
            file_obj.write(response.text)

总结:
关注方法名称

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 6
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

笔记本IT

您的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值