一、为什么要保持验证码cookies的一致性?
众所周知,http是无状态的请求,每一次访问都代表一次新的连接,通俗点讲,如果在登录时网页状态的cookies与验证码状态的cookies不一致,将会无法登录
二、requests和scrapy中保持验证码cookies一致的方法
1.requests
以古诗文网登录为例:
import requests
from lxml import etree
url = "https://so.gushiwen.cn/user/login.aspx"
headers = {
'User-Agent':"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
}
session = requests.session()
response = session.get(url, headers=headers)
html = etree.HTML(response.content)
captcha_url = 'https://so.gushiwen.cn'+ html.xpath('//*[@id="imgCode"]/@src')[0]
__VIEWSTATE = html.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = html.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
code_response = session.get(captcha_url,headers=headers)
with open('code.jpg','wb') as f:
f.write(code_response.content)
code = input('请输入验证码:')
username = input('请输入用户名:')
password = input('请输入密码:')
formdata = {
"__VIEWSTATE": __VIEWSTATE,
"__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR,
"from": "",
"email": username,
"pwd": password,
"denglu": "登录",
'code': code
}
success = session.post("https://so.gushiwen.cn/user/login.aspx",headers=headers,data=formdata)
with open('success.html','wb') as f:
f.write(success.content)
在success.html中看见标题为我的收藏,代表登录成功
requests中利用requests.session()保持验证码登录cookies的一致性,这里很简单,不再赘述
2.scrapy
再次以古诗文网登录为例:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'login'
allowed_domains = ['gushiwen.cn']
start_urls = ['https://so.gushiwen.cn/user/login.aspx']
capcha_image = 'https://so.gushiwen.cn/RandCode.ashx'
def parse(self, response):
captcha_url = response.urljoin(response.xpath('//*[@id="imgCode"]/@src').get())
__VIEWSTATE = response.xpath('//*[@id="__VIEWSTATE"]/@value').get()
__VIEWSTATEGENERATOR = response.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value').get()
formdata = {
"__VIEWSTATE": __VIEWSTATE,
"__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR,
"from": "",
"email": "用户名",
"pwd": "密码",
"denglu": "登录"
}
with open('gushitest.html','wb') as f:
f.write(response.body)
yield scrapy.Request(url=captcha_url,meta={'info':formdata},callback=self.packdata)
def packdata(self,response):
post_data = response.meta['info']
with open('验证码古诗.png','wb') as f:
f.write(response.body)
captcha = input("请输入验证码: ")
post_data['code'] = captcha
yield scrapy.FormRequest(url=self.start_urls[0],formdata=post_data,callback=self.check_login)
def check_login(self,response):
print('判断是否登录成功',response.url)
scrapy在第一次发送请求的时候,response中已经有了在第一次发送请求过程中获取的验证码以及对应的cookies状态,如果此时在引用requests方法去获取验证码,将获取的是第二次的验证码,两次不一致,故无法登录。
此时必须将第一次获取的验证码以及状态保持下来,将第一次获取到的验证码的url传给packdata()方法做解析,meta参数传送的则是第一次中formdata变量里面的部分参数,packdata()用于验证码处理,最后将识别的验证码参数一起放到formdata变量里面再次传送给check_login()方法,检查是否登录成功
可以看到返回我的收藏这个网址,则表示已经登录成功