一、使用requests模拟登录
在前面的requests章节中我们已经讲解总结了requests操作cooike,所以我们直接提供实例完成登录:
1、创建session对象
requests库的session对象能够帮我们跨请求保持某些参数,也会在同一个session实例发出的所有请求之间保持cookies。只要我们定义了Session对象,后续就不用继续维护Cookies。
s = requests.Session()
2、分析登录接口
登录之前我们需要先分析登录链接,我们以登录Github为例子,登录网站最先需要找到登录页面,然后打开调试模式,点击Network查看所有的请求链接,为了防止跳转导致链接被刷新,所以可以如方法2中选中Preserve log。我们可以找到登录请求的链接为https://github.com/session,然后请求方法为POST,然后我们可以看到请求的所有尝试如5中所演示:
3、查看网页源码
从上面我们可以看到,commit和utf8为固定值,所以我们直接设置即可,然后我们再浏览器中右键查看源码,可以看到authenticity_token为网页中每次随机生成的加密数据。我们只需要请求登录之前获取到authenticity_token即可。
requests = s.get("https://github.com/login")
soup = BeautifulSoup(requests.text,"lxml")
authenticity_token = soup.find("input",attrs={"name":"authenticity_token"}).get("value")
print(authenticity_token)
4、开始登录Github
接下来我们直接带参数请求登录链接即可,我们可以看到服务器是没有任何返回的,我们只要请求登录成功即可爬取登录后的链接!
payload = {'commit': 'Sign in', 'utf8': '✓', 'authenticity_token':authenticity_token,'login':self.login,'password':self.password}
requests = s.post("https://github.com/session",date=payload)
5、获取登录后的网页信息
登录成功后我们可以获取需登录的信息,爬取方法与前面讲到的爬虫获取一样,这里我们获取的是个人基本信息。
{'name': '', 'email': '', 'user_profile_bio': '', 'user_profile_blog': '', 'user_profile_company': ''}
6、完整代码
# coding=utf-8
import requests
from bs4 import BeautifulSoup
class Github:
def __init__(self,login,password):
self.request = requests.session()
self.login = login
self.password = password
def main(self):
s = self.request
# 获取authenticity_token
authenticity_token = self.getAuthenticityToken(s)
# 执行登录操作
self.loginSubmit(s,authenticity_token)
# 获取登录后的链接
profile = self.profile(s)
return profile
def profile(self,s):
requests = s.get("https://github.com/settings/profile")
soup = BeautifulSoup(requests.text,"lxml")
name = soup.find("input",id="user_profile_name").get("value")
user_profile_email = soup.find("select",id="user_profile_email")
emails = user_profile_email.find_all("option")
if len(emails) >= 2:
email = emails[1].text
else:
email = ""
user_profile_bio = soup.find("textarea",id="user_profile_bio").text
user_profile_blog = soup.find("input",id="user_profile_blog").get("value")
user_profile_company = soup.find("input",id="user_profile_company").get("value")
return {"name":name,"email":email,"user_profile_bio":user_profile_bio,"user_profile_blog":user_profile_blog,"user_profile_company":user_profile_company}
def loginSubmit(self,s,authenticity_token):
payload = {'commit': 'Sign in', 'utf8': '✓', 'authenticity_token':authenticity_token,'login':self.login,'password':self.password}
requests = s.post("https://github.com/session",data=payload)
def getAuthenticityToken(self,s):
try:
requests = s.get("https://github.com/login")
soup = BeautifulSoup(requests.text,"lxml")
authenticity_token = soup.find("input",attrs={"name":"authenticity_token"}).get("value")
return authenticity_token
except:
print("获取authenticity_token失败,10秒后查询获取")
self.getAuthenticityToken(s)
if __name__ == '__main__':
usersname = ""
password = ""
github = Github(usersname,password)
login = github.main()
print(login)
二、保存登录cookies以及使用cookies登录
如果每次都去登录网站,对于我们来说非常麻烦,我们可以保存下来登录的cookies,我们下次请求登录后链接直接带上这个cookies即可。
1、保存登录cookies
前面我们已经看到了多种方法获取当前的cookies,我们使用s.cookies即可获取到登录后的cookies,获取到的cookies类型为RequestsCookieJar,不方便储存,我们可以使用如下方法将CookieJar转为字典。
cookies = requests.utils.dict_from_cookiejar(r.cookies)
我们这里为了方便,直接将Cookies存储在一个名为cookies.json的json文件中,方便下次调用和观察。
with open('cookies.json', 'w') as f:
f.write(json.dumps(cookies))
2、使用cookies登录
获取到了Cookies,我们直接带上它请求需要登录的网址即可。
注意
通过这里我们可以了解到,有一些网站登录加密算法非常麻烦;我们需要花费大量时间来编写登录爬虫。如果面对不常用或者一次性的爬虫,我们可以直接通过浏览器登录网站后,直接使用cookies来编写我们的爬虫即可,非常方便。
首先我们读取我们上面存储的cookies.json文件。
with open('cookies.json', 'r', encoding='utf-8') as f:
logininfo = json.loads(f.read())
由于我们存储的是json文件,所以我们需要将字典转为CookieJar:
cookies = requests.utils.cookiejar_from_dict(logininfo, cookiejar=None, overwrite=True)
接下来我们在请求中带上cookies即可:
req = requests.get("https://github.com/settings/profile", cookies=cookies)
3、完整代码
这里我们使用了两种方式获取登录后的信息,获取的信息完全相同!
# coding=utf-8
import requests
import json
from bs4 import BeautifulSoup
class Github:
def __init__(self, login, password):
self.request = requests.session()
self.login = login
self.password = password
def main(self):
s = self.request
# 获取authenticity_token
authenticity_token = self.getAuthenticityToken(s)
# 执行登录操作
self.loginSubmit(s, authenticity_token)
# 保存cookies
self.saveCookies(s)
# 获取cookies
cookies = self.getCookiesFromJson()
# 获取登录后的链接
profilesuessession = self.profile(s)
print(profilesuessession)
# 使用cookie获取登录后的信息
profileusecookies = self.profileUseCookies(cookies)
return profileusecookies
# 保存登录后的cookie
def saveCookies(self, s):
cookies = requests.utils.dict_from_cookiejar(s.cookies)
with open('cookies.json', 'w') as f:
f.write(json.dumps(cookies))
def getCookiesFromJson(self):
with open('cookies.json', 'r', encoding='utf-8') as f:
logininfo = json.loads(f.read())
cookies = requests.utils.cookiejar_from_dict(logininfo, cookiejar=None, overwrite=True)
return cookies
def profileUseCookies(self, cookies):
req = requests.get("https://github.com/settings/profile", cookies=cookies)
soup = BeautifulSoup(req.text, "lxml")
name = soup.find("input", id="user_profile_name").get("value")
user_profile_email = soup.find("select", id="user_profile_email")
emails = user_profile_email.find_all("option")
if len(emails) >= 2:
email = emails[1].text
else:
email = ""
user_profile_bio = soup.find("textarea", id="user_profile_bio").text
user_profile_blog = soup.find("input", id="user_profile_blog").get("value")
user_profile_company = soup.find("input", id="user_profile_company").get("value")
return {"name": name, "email": email, "user_profile_bio": user_profile_bio,
"user_profile_blog": user_profile_blog, "user_profile_company": user_profile_company}
def profile(self, s):
req = s.get("https://github.com/settings/profile")
soup = BeautifulSoup(req.text, "lxml")
name = soup.find("input", id="user_profile_name").get("value")
user_profile_email = soup.find("select", id="user_profile_email")
emails = user_profile_email.find_all("option")
if len(emails) >= 2:
email = emails[1].text
else:
email = ""
user_profile_bio = soup.find("textarea", id="user_profile_bio").text
user_profile_blog = soup.find("input", id="user_profile_blog").get("value")
user_profile_company = soup.find("input", id="user_profile_company").get("value")
return {"name": name, "email": email, "user_profile_bio": user_profile_bio,
"user_profile_blog": user_profile_blog, "user_profile_company": user_profile_company}
def loginSubmit(self, s, authenticity_token):
payload = {'commit': 'Sign in', 'utf8': '✓', 'authenticity_token': authenticity_token, 'login': self.login,
'password': self.password}
requests = s.post("https://github.com/session", data=payload)
def getAuthenticityToken(self, s):
try:
requests = s.get("https://github.com/login")
soup = BeautifulSoup(requests.text, "lxml")
authenticity_token = soup.find("input", attrs={"name": "authenticity_token"}).get("value")
return authenticity_token
except:
print("获取authenticity_token失败,10秒后查询获取")
self.getAuthenticityToken(s)
if __name__ == '__main__':
usersname = ""
password = ""
github = Github(usersname, password)
login = github.main()
print(login)