爬取人人网用户个人页面
如图的页面是我们这次需要爬取的页面,即人人网的个人页面
一开始我们的想法是利用爬虫模拟登录,然后再获取我们需要爬取的网页页面。
想到这里,我们就先来写代码吧。
import requests
from lxml import etree
import base64
import json
def base64_api(uname, pwd, img, typeid):
with open(img, 'rb') as f:
base64_data = base64.b64encode(f.read())
b64 = base64_data.decode()
data = {"username": uname, "password": pwd,"typeid":typeid, "image": b64}
result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text)
if result['success']:
return result["data"]["result"]
else:
return result["message"]
return ""
if __name__ == "__main__":
# UA伪装
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36"
}
# 指定url
url = "http://www.renren.com/"
# 获取源码
page_text = requests.get(url = url, headers = header).text
# xpath 解析
tree = etree.HTML(page_text)
src = tree.xpath('//*[@id="verifyPic_login"]/@src')[0] # 验证码网址
# 保存验证码
photo = requests.get(url = src, headers = header).content
with open("./验证码.jpg", "wb") as fp:
fp.write(photo)
img_path = "C:\\Users\\ASUS\\Desktop\\CSDN\\验证码识别\\验证码.jpg"
result = base64_api(uname='', pwd='', img=img_path, typeid=3)
print(result)
# 登录网址
url = "http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=20214120314"
data = {
"email": "15157485037",
"icode": result,
"origURL": "http://www.renren.com/home",
"domain": "renren.com",
"key_id": "1",
"captcha_type": "web_login",
"password": "8a62222be07c2cf68e8d68f4617fe01d7dbc488427d0bc61666ab8a6e56e94f0",
"rkey": "07a9f1810ecf9b507634a45447a628e7",
"f": "http%3A%2F%2Fwww.renren.com%2F976706166%2Fprofile"
}
# 模拟登录
response = requests.post(url = url, headers = header, data = data)
# 判断登录状态。 200 说明登录成功
print(response.status_code)
# 用户个人页面
url = "http://www.renren.com/976706166/profile"
# 请求
response = requests.get(url = url, headers = header).text
# 保存
with open("人人网用户个人页面.html", "w", encoding = "utf-8") as fp:
fp.write(response)
print("over!!!")
然后我们看看运行的结果
然后我们打开保存的网站,发现明明自己登录成功了(status code 是 200),但爬取到的网页却告诉我们登录出了问题。
那这是为什么呢?
简单来说就是 http 协议是不保存登录状态的,也就是说前面的模拟登录和后面的对网页的请求是没有任何关系的,是没有登录状态记录的。
那我们要怎么办呢?
一般情况下,我们可以在请求头(header)的字典里增加 cookie 的数据,但这种方法也有其局限性,因为一些网站的 cookie 是动态变化的。所以在这里我们介绍一种 requests 模块的方法(类)—— session 。它可以自动记录请求时的登录状态,并且在下一个请求中使用这些记录的数据。
了解了之后,我们就可以开始写代码了,和之前的代码差不多,就是改一些地方。
import requests
from lxml import etree
import base64
import json
def base64_api(uname, pwd, img, typeid):
with open(img, 'rb') as f:
base64_data = base64.b64encode(f.read())
b64 = base64_data.decode()
data = {"username": uname, "password": pwd,"typeid":typeid, "image": b64}
result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text)
if result['success']:
return result["data"]["result"]
else:
return result["message"]
return ""
if __name__ == "__main__":
# session
session = requests.Session()
# UA伪装
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36"
}
# 指定url
url = "http://www.renren.com/"
# 获取源码
page_text = requests.get(url = url, headers = header).text
# xpath 解析
tree = etree.HTML(page_text)
src = tree.xpath('//*[@id="verifyPic_login"]/@src')[0] # 验证码网址
# 保存验证码
photo = requests.get(url = src, headers = header).content
with open("./验证码.jpg", "wb") as fp:
fp.write(photo)
img_path = "C:\\Users\\ASUS\\Desktop\\CSDN\\验证码识别\\验证码.jpg"
result = base64_api(uname='', pwd='', img=img_path, typeid=3)
print(result)
# 登录网址
url = "http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=20214120314"
data = {
"email": "15157485037",
"icode": result,
"origURL": "http://www.renren.com/home",
"domain": "renren.com",
"key_id": "1",
"captcha_type": "web_login",
"password": "8a62222be07c2cf68e8d68f4617fe01d7dbc488427d0bc61666ab8a6e56e94f0",
"rkey": "07a9f1810ecf9b507634a45447a628e7",
"f": "http%3A%2F%2Fwww.renren.com%2F976706166%2Fprofile"
}
# 模拟登录。登陆时使用 session,记录登录状态
response = session.post(url = url, headers = header, data = data)
# 判断登录状态。 200 说明登录成功
print(response.status_code)
# 用户个人页面
url = "http://www.renren.com/976706166/profile"
# 请求时也使用 session。
response = session.get(url = url, headers = header).text
# 保存
with open("人人网用户个人页面.html", "w", encoding = "utf-8") as fp:
fp.write(response)
print("over!!!")
运行结束后我们再来看看保存的网页数据
如图:
这说明我们现在完成了爬取人人网用户页面的数据。