scrapy处理豆瓣登录验证码的三种方式

最新推荐文章于 2024-08-04 10:14:49 发布

辉辉咯

最新推荐文章于 2024-08-04 10:14:49 发布

阅读量2.9k

点赞数 1

分类专栏： python爬虫

本文链接：https://blog.csdn.net/qq_41020281/article/details/79437455

版权

在使用Scrapy爬取豆瓣并涉及登录时，会遇到验证码处理。本文介绍了三种验证码识别方法：1) OCR识别，虽然准确率较低，但可以通过训练提升；2) 第三方打码平台，虽付费但能提供稳定识别；3) 人工识别，准确但费时。最后，通过判断登录状态来确认是否成功登录。

摘要由CSDN通过智能技术生成

在python爬虫登录豆瓣时候，遇到登录验证码的处理问题时，有三种验证码处理方式。首先获取登录页面，将验证码图片下载下来，利用三种不同验证码识别的方法来获取验证码，然后登录：

    def start_requests(self):
        yield scrapy.Request(
            self.login_url,
            callback=self.parse_login
        )
    #解析登录页面,得到验证码链接
    def parse_login(self,response):
        captcha_url = response.xpath('//img[@id="captcha_image"]/@src').extract_first()
        yield scrapy.Request(
            captcha_url,
            meta={'login_response':response},
            callback=self.login

    def login(self,response):
        login_response = response.meta.get('login_response')
        captcha_data = response.body
        #遍历验证码识别方式，直到一种可以识别出来的方式
        for method in [self.get_captcha_by_self(captcha_data)]:
        #for method in ['self.g