Python网络爬虫之模拟登录（以知乎为例）

最新推荐文章于 2023-12-30 13:28:42 发布

柱子89

最新推荐文章于 2023-12-30 13:28:42 发布

阅读量792

点赞数

分类专栏：爬虫

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

参考：Web Crawler with Python - 08.模拟登录 (知乎)

三个问题：

在实践时，发现该行报错：

[python]view plaincopy 
   
 _xsrf = BeautifulSoup(session.get('https://www.zhihu.com/#signin').content).find('input', attrs={'name': '_xsrf'})['value']  

于是在chrome下F12再次分析一下登录过程之后，在requests的headers中加入User-Agent，发现可以获得_xsrf 字段。

接下来获取验证码和请求时同理加上User-Agent。

之后再获取验证码时，发现获得的结果如下：

ERR_VERIFY_CAPTCHA_SESSION_INVALID

再次分析获得验证码的请求（更新验证码）：

考虑请求时使用的requests的session机制，已经携带了cookie信息。于是怀疑是url的问题。

改成如下解决：

[python]view plaincopy 
   
 captcha_content = session.get('http://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000), headers=headers).content  

最后修改断言，返回结果如下：

代码：

[python]view plaincopy 
   
 #!/usr/bin/python  
 # -*- coding: utf-8 -*-  
   
 import time  
 import requests  
 from bs4 import BeautifulSoup  
   
   
 headers = {  
 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36',  
 # 'Referer':'https://www.zhihu.com/',  
 # 'X-Requested-With': 'XMLHttpRequest',  
 # 'Origin':'https://www.zhihu.com'  
 }  
   
 def login(username, password, kill_captcha):  
     session = requests.session()  
     _xsrf = BeautifulSoup(session.get('https://www.zhihu.com/#signin', headers=headers).content).find('input', attrs={'name': '_xsrf'})['value']  
     session.headers.update({'_xsrf':str(_xsrf)})  
     #加入type=login 否则：ERR_VERIFY_CAPTCHA_SESSION_INVALID  
     captcha_content = session.get('http://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000), headers=headers).content  
     data = {  
         '_xsrf': _xsrf,  
         'password': password,  
         'captcha': kill_captcha(captcha_content),  
         'email': username,  
         'remember_me': 'true'  
         # 字典的键值对顺序可以随机  
     }  
     print data  
     resp = session.post('http://www.zhihu.com/login/email', data=data, headers=headers).content  
     # 登录成功  
     print 'resp\n',resp  
     assert r'\u767b\u5f55\u6210\u529f' in resp  
     return session  
   
   
 def kill_captcha(data):  
     with open('1.gif', 'wb') as fp:  
         fp.write(data)  
     return raw_input('captcha : ')  
   
 if __name__ == '__main__':  
     session = login('email', 'password', kill_captcha)  
     print BeautifulSoup(session.get("https://www.zhihu.com",headers=headers).content).find('span', class_='name').getText()  
 
   
 

柱子89

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python网络爬虫之模拟登录（以知乎为例）

参考：Web Crawler with Python - 08.模拟登录 (知乎)三个问题：在实践时，发现该行报错：[python] view plain copy _xsrf = BeautifulSoup(session.get('https://www.zhihu.com/#signin').content).f
复制链接

扫一扫

专栏目录