python爬虫requests实战_[ Python爬虫实战 ] Python使用Requests爬取登录后的网页以及使用cookies - pytorch中文网...

最新推荐文章于 2024-04-22 08:15:09 发布

weixin_39939510

最新推荐文章于 2024-04-22 08:15:09 发布

阅读量509

点赞数

文章标签： python爬虫requests实战

一、使用requests模拟登录

在前面的requests章节中我们已经讲解总结了requests操作cooike，所以我们直接提供实例完成登录：

1、创建session对象

requests库的session对象能够帮我们跨请求保持某些参数，也会在同一个session实例发出的所有请求之间保持cookies。只要我们定义了Session对象，后续就不用继续维护Cookies。

s = requests.Session()

2、分析登录接口

登录之前我们需要先分析登录链接，我们以登录Github为例子，登录网站最先需要找到登录页面，然后打开调试模式，点击Network查看所有的请求链接,为了防止跳转导致链接被刷新，所以可以如方法2中选中Preserve log。我们可以找到登录请求的链接为https://github.com/session，然后请求方法为POST，然后我们可以看到请求的所有尝试如5中所演示：

3、查看网页源码

从上面我们可以看到，commit和utf8为固定值，所以我们直接设置即可，然后我们再浏览器中右键查看源码，可以看到authenticity_token为网页中每次随机生成的加密数据。我们只需要请求登录之前获取到authenticity_token即可。

requests = s.get("https://github.com/login")

soup = BeautifulSoup(requests.text,"lxml")

authenticity_token = soup.find("input",attrs={"name":"authenticity_token"}).get("value")

print(authenticity_token)

4、开始登录Github

接下来我们直接带参数请求登录链接即可，我们可以看到服务器是没有任何返回的，我们只要请求登录成功即可爬取登录后的链接！

payload = {'commit': 'Sign in', 'utf8': '✓', 'authenticity_token':authenticity_token,'login':self.login,'password':self.password}

requests = s.post("https://github.com/session",date=payload)

5、获取登录后的网页信息

登录成功后我们可以获取需登录的信息，爬取方法与前面讲到的爬虫获取一样,这里我们获取的是个人基本信息。

{'name': '', 'email': '', 'user_profile_bio': '', 'user_profile_blog': '', 'user_profile_company': ''}

6、完整代码

# coding=utf-8

import requests

from bs4 import BeautifulSoup

class Github:

def __init__(self,login,password):

self.request = requests.session()

self.login = login

self.password = password

def main(self):

s = self.request

# 获取authenticity_token

authenticity_token = self.getAuthenticityToken(s)

# 执行登录操作

self.loginSubmit(s,authenticity_token)

# 获取登录后的链接

profile = self.profile(s)

return profile

def profile(self,s):

requests = s.get("https://github.com/settings/profile")

soup = BeautifulSoup(requests.text,"lxml")

name = soup.find("input",id="user_profile_name").get("value")

user_profile_email = soup.find("select",id="user_profile_email")

emails = user_profile_email.find_all("option")

if len(emails) >= 2:

email = emails[1].text

else:

email = ""

user_profile_bio = soup.find("textarea",id="user_profile_bio").text

user_profile_blog = soup.find("input",id="user_profile_blog").get("value")

user_profile_company = soup.find("input",id="user_profile_company").get("value")

return {"name":name,"email":email,"user_profile_bio":user_profile_bio,"user_profile_blog":user_profile_blog,"user_profile_company":user_profile_company}

def loginSubmit(self,s,authenticity_token):

payload = {'commit': 'Sign in', 'utf8': '✓', 'authenticity_token':authenticity_token,'login':self.login,'password':self.password}

requests = s.post("https://github.com/session",data=payload)

def getAuthenticityToken(self,s):

try:

requests = s.get("https://github.com/login")

soup = BeautifulSoup(requests.text,"lxml")

authenticity_token = soup.find("input",attrs={"name":"authenticity_token"}).get("value")

return authenticity_token

except:

print("获取authenticity_token失败,10秒后查询获取")

self.getAuthenticityToken(s)

if __name__ == '__main__':

usersname = ""

password = ""

github = Github(usersname,password)

print(login)

二、保存登录cookies以及使用cookies登录

如果每次都去登录网站，对于我们来说非常麻烦，我们可以保存下来登录的cookies,我们下次请求登录后链接直接带上这个cookies即可。

1、保存登录cookies

前面我们已经看到了多种方法获取当前的cookies，我们使用s.cookies即可获取到登录后的cookies，获取到的cookies类型为RequestsCookieJar,不方便储存，我们可以使用如下方法将CookieJar转为字典。

cookies = requests.utils.dict_from_cookiejar(r.cookies)

我们这里为了方便，直接将Cookies存储在一个名为cookies.json的json文件中，方便下次调用和观察。

with open('cookies.json', 'w') as f:

f.write(json.dumps(cookies))

2、使用cookies登录

获取到了Cookies，我们直接带上它请求需要登录的网址即可。

注意

通过这里我们可以了解到，有一些网站登录加密算法非常麻烦；我们需要花费大量时间来编写登录爬虫。如果面对不常用或者一次性的爬虫，我们可以直接通过浏览器登录网站后，直接使用cookies来编写我们的爬虫即可，非常方便。

首先我们读取我们上面存储的cookies.json文件。

with open('cookies.json', 'r', encoding='utf-8') as f:

logininfo = json.loads(f.read())

由于我们存储的是json文件，所以我们需要将字典转为CookieJar：

cookies = requests.utils.cookiejar_from_dict(logininfo, cookiejar=None, overwrite=True)

接下来我们在请求中带上cookies即可：

req = requests.get("https://github.com/settings/profile", cookies=cookies)

3、完整代码

这里我们使用了两种方式获取登录后的信息，获取的信息完全相同！

# coding=utf-8

import requests

import json

from bs4 import BeautifulSoup

class Github:

def __init__(self, login, password):

self.request = requests.session()

self.login = login

self.password = password

def main(self):

s = self.request

# 获取authenticity_token

authenticity_token = self.getAuthenticityToken(s)

# 执行登录操作

self.loginSubmit(s, authenticity_token)

# 保存cookies

self.saveCookies(s)

# 获取cookies

cookies = self.getCookiesFromJson()

# 获取登录后的链接

profilesuessession = self.profile(s)

print(profilesuessession)

# 使用cookie获取登录后的信息

profileusecookies = self.profileUseCookies(cookies)

return profileusecookies

# 保存登录后的cookie

def saveCookies(self, s):

cookies = requests.utils.dict_from_cookiejar(s.cookies)

with open('cookies.json', 'w') as f:

f.write(json.dumps(cookies))

def getCookiesFromJson(self):

with open('cookies.json', 'r', encoding='utf-8') as f:

logininfo = json.loads(f.read())

cookies = requests.utils.cookiejar_from_dict(logininfo, cookiejar=None, overwrite=True)

return cookies

def profileUseCookies(self, cookies):

req = requests.get("https://github.com/settings/profile", cookies=cookies)

soup = BeautifulSoup(req.text, "lxml")

name = soup.find("input", id="user_profile_name").get("value")

user_profile_email = soup.find("select", id="user_profile_email")

emails = user_profile_email.find_all("option")

if len(emails) >= 2:

email = emails[1].text

else:

email = ""

user_profile_bio = soup.find("textarea", id="user_profile_bio").text

user_profile_blog = soup.find("input", id="user_profile_blog").get("value")

user_profile_company = soup.find("input", id="user_profile_company").get("value")

return {"name": name, "email": email, "user_profile_bio": user_profile_bio,

"user_profile_blog": user_profile_blog, "user_profile_company": user_profile_company}

def profile(self, s):

req = s.get("https://github.com/settings/profile")

soup = BeautifulSoup(req.text, "lxml")

name = soup.find("input", id="user_profile_name").get("value")

user_profile_email = soup.find("select", id="user_profile_email")

emails = user_profile_email.find_all("option")

if len(emails) >= 2:

email = emails[1].text

else:

email = ""

user_profile_bio = soup.find("textarea", id="user_profile_bio").text

user_profile_blog = soup.find("input", id="user_profile_blog").get("value")

user_profile_company = soup.find("input", id="user_profile_company").get("value")

return {"name": name, "email": email, "user_profile_bio": user_profile_bio,

"user_profile_blog": user_profile_blog, "user_profile_company": user_profile_company}

def loginSubmit(self, s, authenticity_token):

payload = {'commit': 'Sign in', 'utf8': '✓', 'authenticity_token': authenticity_token, 'login': self.login,

'password': self.password}

requests = s.post("https://github.com/session", data=payload)

def getAuthenticityToken(self, s):

try:

requests = s.get("https://github.com/login")

soup = BeautifulSoup(requests.text, "lxml")

authenticity_token = soup.find("input", attrs={"name": "authenticity_token"}).get("value")

return authenticity_token

except:

print("获取authenticity_token失败,10秒后查询获取")

self.getAuthenticityToken(s)

if __name__ == '__main__':

usersname = ""

password = ""

github = Github(usersname, password)

print(login)

weixin_39939510

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫