scrapy带帐号密码的爬取

最新推荐文章于 2024-05-05 19:53:58 发布

slibra_L

最新推荐文章于 2024-05-05 19:53:58 发布

阅读量1.4k

点赞数

分类专栏：网络爬虫

本文链接：https://blog.csdn.net/slibra_L/article/details/89533109

版权

网络爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

最近在对github和gitlab的issue进行爬取，遇到了很多坑，经过一系列的调研终于解决了问题

1、基本设置

settings.py下，设置，否则无法正常爬取网站，可能造成无返回的情况：

ROBOTSTXT_OBEY = False

ROBOTSTXT_OBEY 默认为True，就是要遵守robots.txt 的规则，那么 robots.txt 是什么？
robots.txt 是遵循 Robot协议的一个文件，它保存在网站的服务器中，它的作用是，告诉搜索引擎爬虫，本网站哪些目录下的网页不希望你进行爬取收录。在Scrapy启动后，会在第一时间访问网站的 robots.txt 文件，然后决定该网站的爬取范围。当然，我们并不是在做搜索引擎，而且在某些情况下我们想要获取的内容恰恰是被 robots.txt 所禁止访问的。所以，某些时候，我们就要将此配置项设置为 False ，拒绝遵守 Robot协议！

2、模拟登陆

由于github/gitlab上的网页必须要登陆之后才可看到数据，所以必须模拟登陆，首先获取authenticity_token 参数，然后使用FormRequest.from_response携带表单数据post，即可提交用户名和密码

    def parse(self, response):
        # 先去拿隐藏的表单参数authenticity_token
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').extract_first()
        print(authenticity_token)

        """第二次用表单post请求，携带Cookie、浏览器代理、用户登录信息，进行登录给Cookie授权"""
        return FormRequest.from_response(
            response,
            url='http://域名/users/sign_in',
            meta={'cookiejar': response.meta['cookiejar']},
            headers=self.headers,
            formdata={
                "utf8": "✓",
                "authenticity_token": authenticity_token,
                "user[login]": "账号",
                "user[password]": "密码",
                "user[remember_me]": '0'
            },
            callback=self.gitlab_after,
            dont_click=True
            # dont_click如果是True，表单数据将被提交，而不需要单击任何元素
        )

3、获取cookies

登陆成功后会重定向到主页，然而此时访问其他页面，仍然会重定向到登陆界面，如何解决这个问题？这就需要在登陆成功后获取cookies，然后携带cookies访问其他页面了。

在发送请求时cookie的操作，

meta={‘cookiejar’:1}表示开启cookie记录，首次请求时写在Request()里
meta={‘cookiejar’:response.meta[‘cookiejar’]}表示使用上一次response的cookie，写在FormRequest.from_response()里post授权
meta={‘cookiejar’:True}表示使用授权后的cookie访问需要登录查看的页面

请求Cookie

Cookie = response.request.headers.getlist('Cookie')
print(Cookie)

响应Cookie

Cookie2 = response.headers.getlist('Set-Cookie')
print(Cookie2)

其他问题

item可以传入 request的meta中，因此可以爬取多个页面的的元素

全部代码

import scrapy
from scrapy.http import Request, FormRequest, HtmlResponse


class BigeventspiderSpider(scrapy.Spider):
    name = 'BigEventSpider'
    allowed_domains = ['域名']
    custom_settings = {
        'ITEM_PIPELINES': {
            'BigEvent.pipelines.BigeventPipeline': 300,
        }
    }
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'host':'域名',
        'Connection': 'keep-alive',
        'Referer': 'http://域名/users/sign_in',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0',

    }
    def start_requests(self):
        urls = ['http://域名/users/sign_in']
        for url in urls:
            # 重写start_requests方法，通过meta传入特殊key cookiejar，爬取url作为参数传给回调函数
            # 第一次请求一下登录页面，设置开启cookie使其得到cookie，设置回调函数
            yield Request(url, meta={'cookiejar': 1}, callback=self.parse)
    # FormRequeset
    def parse(self, response):
        # 先去拿隐藏的表单参数authenticity_token
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').extract_first()
        print(authenticity_token)

        # 第二次用表单post请求，携带Cookie、浏览器代理、用户登录信息，进行登录给Cookie授权
        return FormRequest.from_response(
            response,
            url='http://域名/users/sign_in',
            meta={'cookiejar': response.meta['cookiejar']},
            headers=self.headers,
            formdata={
                "utf8": "✓",
                "authenticity_token": authenticity_token,
                "user[login]": "账号",
                "user[password]": "密码",
                "user[remember_me]": '0'
            },
            callback=self.gitlab_after,
            dont_click=True
            # dont_click如果是True，表单数据将被提交，而不需要单击任何元素
        )

    def gitlab_after(self, response):
        # 响应Cookie
        # Cookie1 = response.headers.getlist('Set-Cookie')   #查看一下响应Cookie，也就是第一次访问注册页面时后台写入浏览器的Cookie
        # print(Cookie1)
        url = "http://域名/bigdata/ToDoList/issues"
        #登录后请求需要登录才能查看的页面，如个人中心，携带授权后的Cookie请求
        yield Request(url, meta={'cookiejar':True}, callback=self.github_tudo)

	def github_tudo(self,response):
	
	  #进入todulist页面
	  issues = response.xpath("//div[@class='issue-main-info']")
	  print(issues)
	  for issue in issues:
	      item = BigeventItem()
	      #对每个issue进行爬取
	      label = issue.xpath(".//span[@class='label color-label has-tooltip']/text()").extract()
	      if "大事件" in label:
	          num = issue.xpath(".//span[@class='issuable-reference']/text()")[0].extract().strip()
	          item["num"] = num
	          name = issue.xpath(".//span[@class='issue-title-text']/a/text()")[0].extract().strip()
	          item["name"] = name
	          url = "域名/bigdata/ToDoList/issues/"+num[1:]
	          yield Request(url, meta={'cookiejar': True,'item':item}, callback=self.github_issue)

    def github_issue(self,response):
        item = response.meta['item']
        date = response.xpath("//div[@class='wiki']/p/text()")[0].extract().strip()
        item["date"] = date
        print(item)
        yield item

slibra_L

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
scrapy带帐号密码的爬取

最近在对github和gitlab进行爬取，遇到了很多坑，经过一系列的调研终于解决了问题1、基本设置settings.py下，设置：ROBOTSTXT_OBEY = FalseROBOTSTXT_OBEY 默认为True，就是要遵守robots.txt 的规则，那么 robots.txt 是什么？robots.txt 是遵循 Robot协议的一个文件，它保存在网站的服务器中，它的作...
复制链接

扫一扫

专栏目录