Cookie,有时也用其复数形式 Cookies,指某些网站为了辨别用户身份、进行 session 跟踪而储存在用户本地终端上的数据(通常经过加密)。定义于 RFC2109 和 2965 中的都已废弃,最新取代的规范是 RFC6265 [1] 。(可以叫做浏览器缓存)
因此,在登录类似于知乎、豆瓣等网站的时候,想要获得登录后的页面,可以携带cookies进行登录。
首先,我们打开知乎创作者的页面:
ok,我们想要获得数据的网页的url 为:
https://www.zhihu.com/creator/analytics/work/answers
我们可以通过F12 进行检查:在Network的Doc里面查看 User-Agent 和 Cookies信息:
可以看到cookies信息为:
_zap=53633022-7cea-4247-b804-b8598042b63f; d_c0="APCow144Yg-PTiGIGVKbfBkfj3gJnkpFtoQ=|1557069483"; capsion_ticket="2|1:0|10:1557069701|14:capsion_ticket|44:ODRiMzU3ZWYzMDRkNGVmNmFlMWYwNTI1NTM5MDM4MmY=|07a9bf8b2298d9a950f3677a20db4503845dd5ce16a9cd85adcc108c4914e334"; z_c0="2|1:0|10:1557069734|4:z_c0|92:Mi4xRW5TYkF3QUFBQUFBOEtqRFhqaGlEeVlBQUFCZ0FsVk5wazI4WFFEQW9acGdZaUItQ1VvaUdoWk1VTVVnWW5ReWVB|bb7977d97288975d9a7db66cb86334743b18aca07c7504d105f65c14526677a4"; _xsrf=c89036b5-781f-44b9-93be-155c6e4bfe9a; tst=r; q_c1=44a008ab49d7483290d71fd4a6b35ed9|1557104823000|1557104823000
referer:
https://www.zhihu.com/people/liu-zi-hua-66-11/activities
ok,我们生成scrapy project 与spider 文件。
在settings.py 中设置User-Agent和一些信息
BOT_NAME = 'zhihu'
SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
LOG_LEVEL="WARNING"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/66.0'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
主要看spider文件:
# -*- coding: utf-8 -*-
import scrapy
import re
class ZhSpider(scrapy.Spider):
name = "zh"
allowed_domains = ["zhihu.com"]
start_urls = ['https://www.zhihu.com/creator/analytics/work/answers']
def start_requests(self):
cookies='_zap=53633022-7cea-4247-b804-b8598042b63f; d_c0="APCow144Yg-PTiGIGVKbfBkfj3gJnkpFtoQ=|1557069483"; capsion_ticket="2|1:0|10:1557069701|14:capsion_ticket|44:ODRiMzU3ZWYzMDRkNGVmNmFlMWYwNTI1NTM5MDM4MmY=|07a9bf8b2298d9a950f3677a20db4503845dd5ce16a9cd85adcc108c4914e334"; z_c0="2|1:0|10:1557069734|4:z_c0|92:Mi4xRW5TYkF3QUFBQUFBOEtqRFhqaGlEeVlBQUFCZ0FsVk5wazI4WFFEQW9acGdZaUItQ1VvaUdoWk1VTVVnWW5ReWVB|bb7977d97288975d9a7db66cb86334743b18aca07c7504d105f65c14526677a4"; tgw_l7_route=66cb16bc7f45da64562a077714739c11; _xsrf=c89036b5-781f-44b9-93be-155c6e4bfe9a; tst=r; q_c1=44a008ab49d7483290d71fd4a6b35ed9|1557104823000|1557104823000'
cookies={i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}
yield scrapy.Request(
self.start_urls[0],
callback=self.parse,
cookies=cookies
)
def parse(self, response):
item={}
item["a"]=response.body.decode()
print(item["a"])
我们看到,我们重写了start_request函数,将cookies传入(注意以字典的方式),在yield 返回的Request时候添加cookies参数。
最后我们看看能不能爬到信息:
def parse(self, response):
item={}
item["a"]=response.body.decode()
print(item["a"])
ok,输入response.body成功,证明可以带cookies登录