这是第一个爬虫,所以由很多地方还需要改进,暂时先总结一下自己的小心得。
登录
因为想把自己的账号作为一个seed,所以session保持登录cookies,顺便练习一下验证码。
1.使用chorme F12 查看networks, 输入账户 点击登录,查看发送地址是什么,发现手机号是发送的https://www.zhihu.com/login/phone_num,email 则是 https://www.zhihu.com/login/email,点击进去查看Headers,参照并设置代码里的header,查看General 是用的POST方法,FormData则是Post的Data。通过浏览器工具初步就可以确定post的内容啦。注意post至少需要url,data,headers三个参数,get只需要两个。response_text = session.post(post_url,data=post_form,headers=header,allow_redirects=False)
查看Preview,返回的json数据,可以通过json.loads这个json文件,进行解析并获取下一步需要的数据。
2.点击验证码 再登录,获取验证码提交的格式,post_form={
‘_xsrf’:search_xsrf(),
‘password’:password,
‘captcha’:down_captha(),
‘captcha_type’:’cn’,
‘phone_num’:account
}
其中down_captha()是发送的验证中文字的坐标,search_xsrf是通过html获取的
import json
import re
from getpass import getpass
import requests
import time
global session
header = {
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36",
'Host': "www.zhihu.com",
'Origin': "http://www.zhihu.com",
'Pragma': "no-cache",
'Referer': "http://www.zhihu.com/",
'X-Requested-With': "XMLHttpRequest"
}
session = requests.session()
def search_xsrf():
response = session.get('https://www.zhihu.com/',headers = header)
results = re.match('[\s\S]*name="_xsrf" value="(.*?)"',response.text)
if results:
return results.group(1)
return ''
def down_captha():
captha_url = 'https://www.zhihu.com/captcha.gif?r=%d&type=login&lang=cn' % (int(time.time()*1000))
response = session.get(captha_url,headers=header)
with open('captcha.gif','wb') as f:
f.write(response.content)
f.close()
from PIL import Image
try:
img = Image.open('captcha.gif')
img.show()
img.close()
except:
pass
captcha = {
'img_size':[200,44],
'input_points':[],
}
points = [[16.875, 28], [32.875, 27], [65.875, 31], [88.875, 24], [106.875, 24], [147.875, 30],
[174.875, 29]]
seq = input('请输入倒立字的位置\n>')
for i in seq:
captcha['input_points'].append(points[int(i)-1])
return json.dumps(captcha)#因为本身是json格式的
def zhihu_Login(account=None,password=None):
if account == None:
print("请输入账户")
account = input()
print("请输入密码")
#password = getpass("请输入密码:")
password = input()
if re.match('1\d{10}',account):
print("手机号登录")
post_url='https://www.zhihu.com/login/phone_num'
post_form={
'_xsrf':search_xsrf(),
'password':password,
'captcha':down_captha(),
'captcha_type':'cn',
'phone_num':account
}
response_text = session.post(post_url,data=post_form,headers=header,allow_redirects=False)
response_text = json.loads(response_text.text)
if 'msg' in response_text and response_text['msg'] == '登录成功':
print("登录成功")
else:
print("登录失败,请重新登录")
zhihu_Login()
# if __name__=='__main__':
# header={
# 'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36",
# 'Host': "www.zhihu.com",
# 'Origin': "http://www.zhihu.com",
# 'Pragma': "no-cache",
# 'Referer': "http://www.zhihu.com/",
# 'X-Requested-With': "XMLHttpRequest"
# }
# session=requests.session()
# zhihu_Login()