爬虫登录知乎

最新推荐文章于 2024-08-06 04:17:42 发布

昵称诚诚

最新推荐文章于 2024-08-06 04:17:42 发布

阅读量710

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/jobschen/article/details/73647257

版权

Python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

要想爬取知乎的内容，必须先登录，网上现在的一些教程有些已经不能用了，这里将自己登录成功的代码记录下，供大家参考：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import time
from http import cookiejar
import requests
from bs4 import BeautifulSoup

#从自己浏览器的request 请求头里copy过来
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch, br',
'Accept-Language':'en,zh-CN;q=0.8,zh;q=0.6',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'DNT':'1',
'Host':'www.zhihu.com',
'Pragma':'no-cache',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
}


# 使用登录cookie信息
session = requests.session()
session.cookies = cookiejar.LWPCookieJar(filename='cookies.txt')
try:
    print(session.cookies)
    session.cookies.load(ignore_discard=True)

except:
    print("还没有cookie信息")


def get_captcha():
    """
    把验证码图片保存到当前目录，手动识别验证码
    :return:
    """
    t = str(int(time.time() * 1000))
    captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
    r = session.get(captcha_url, headers=headers)
    with open('captcha.jpg', 'wb') as f:
        f.write(r.content)
    captcha = input("验证码：")
    return captcha


def get_xsrf():
    response = session.get("https://www.zhihu.com", headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    xsrf = soup.find('input', attrs={"name": "_xsrf"}).get("value")
    return xsrf


def login(email, password):
    login_url = 'https://www.zhihu.com/login/email'
    data = {
        'email': email,
        'password': password,
        '_xsrf': get_xsrf(),
        #"captcha": get_captcha(),
        'remember_me': 'true'}
    response = session.post(login_url, data=data, headers=headers)
    login_code = response.json()
    print(login_code['msg'])
    for i in session.cookies:
        print(i)
    session.cookies.save()


if __name__ == '__main__':
login('xxxxx@xxx.xxx', 'your password')