Python3爬虫模拟新浪微博登录（2019-8-3）最新

ITblz

已于 2024-08-21 22:36:48 修改

阅读量2.7k

点赞数 2

分类专栏： Python3爬虫文章标签： Python3 爬虫微博模拟登录最新

于 2019-08-03 19:46:21 首次发布

本文链接：https://blog.csdn.net/Blz624613442/article/details/98368815

版权

Python3爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Python3爬虫模拟新浪微博登录

初学Python3小白一枚，若有错误请不吝赐教

过程分析

Fiddler抓包登录新浪微博的过程

整个过程 从填写数据到跳转到主页一共经历了七个步骤：
1.在登陆前，输入账号结束，失去输入框焦点，浏览器会发送两个请求，分别请求了登陆前加密密码所需的servertime、nonce、pubkey（图中对应3）
2．第二个ajax请求的是关于验证码的（图中对应4、5）
3．这个是点击登录按钮后，将加密后的数据post到服务器（图中对应6）
4．服务器会返回一系列数据，（对应图中的8），它包含了重定向的地址
5．多次页面加载后，会接收到来自服务器的json数据包（对应图中的19）在这
个数据包中包含了每个微博用户特定的uniqueid
6．这是一个携带有相关用户信息的script脚本（对应图中的22）
7．经过一系列的跳转后，最终跳转到个人主页面

遇到的坑

遇到的最严重的坑，就是验证码的啦
验证码的请求分析，着实耗费了大半天。这是验证码请求的服务器地址：
https://login.sina.com.cn/cgi/pin.php?r=18674039&s=0&p=yf-c92f2edb50c21d4bcbfdc3fccfdb94c4c23f
其中分析后发现：
固定服务器url：https://login.sina.com.cn/cgi/pin.php?
携带的参数：r = 18674039，p = yf-c92f2edb50c21d4bcbfdc3fccfdb94c4c23f，s = 0
其中s和p是固定值，r是一串不固定变化的数字，在分析的过程中，我试图寻找关于r的规律，最后实在没办法，我打算在Fiddler中测试一下，看看不加参数r是否能获得，结果是可喜可贺的，确实获得了。
事实上，在保证cookie一致的情况下，去请求验证码，也就是说和你本次的登录保持在一个cookie中，可以主动的抓取验证码。

# 获取验证码
def get_verificationcode():
    print("开始请求获取验证码...")
    url = "https://login.sina.com.cn/cgi/pin.php?s=0&p=yf-9f5e31626347e127bc21874aa9d6f4d745ca"
    request.urlretrieve(url=url, filename="./img/code.jpg")
    print("验证码获取成功！")
    return input("请输入验证码：")

在应对验证码问题上，我这里采用的是半人工半自动的，将每次获得的验证码存到当前目录下的img文件中，人工查看和输入验证码。

关于第一步，账号和密码加密

经分析在第二步发送登录的post请求之前，浏览器会实现发一个请求，请求的响应信息如下：

里边携带的servertime、pubkey、nonce、rsakv等关键信息在后边加密密码和包装post请求数据非常关键。关于如何得到这个分析以及整个爬虫的结构也，参考了这篇博文：https://www.cnblogs.com/houkai/p/3488468.html.
以下是加密模块encrypt.py的代码

import base64
import binascii
import rsa

# 对用户名加密
def encryUsername(username):
    print("开始加密用户名...")
    text = (base64.b64encode(username.encode(encoding="utf-8")))
    text = text.decode()
    return str(text).replace("=", "")

# 对密码加密
def encryPassword(password,servertime,nonce,pubkey):
    print("开始加密密码...")
    rsaPublickey = int(pubkey, 16)
    key = rsa.PublicKey(rsaPublickey, 65537)  # 创建公钥
    message = str(servertime) + '\t' + str(nonce) + '\n' + str(password)  # 拼接明文js加密文件中得到
    message = bytes(message, encoding="utf-8")
    passwd = rsa.encrypt(message, key)  # 加密
    passwd = binascii.b2a_hex(passwd)  # 将加密信息转换为16进制。
    return passwd

登录请求post数据包装

在Chrome的开发者模式下，可以抓取相关参数信息

# 组织post数据
def get_postData(su,password,servertime,nonce,pubkey,rsakv):
    print("开始组织post数据...")
    # 密码加密
    sp = encrypt.encryPassword(password, servertime, nonce, pubkey)
    # 验证码请求
    door = get_verificationcode()
    # 构造post请求参数
    data = {
        "door": door,
        "entry": "weibo",
        "gateway": 1,
        "from": "",
        "savestate": 7,
        "su": su,
        "sp": sp,
        "servertime": servertime,
        "service": "miniblog",
        "nonce": nonce,
        "rsakv": rsakv,
        "encoding": "UTF-8",
        "domain": "sina.com.cn",
        "returntype": "META",
        "vsnf": 1,
        "useticket": 1,
        "pwencode": "rsa2",
        "prelt": 372,
        "qrcode_flag": "false",
        "url": "https://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack"
    }
    data = parse.urlencode(data).encode("utf-8")
    return data

关于登录后如跳转到主页面

这个过程参考了：https://www.cnblogs.com/woaixuexi9999/p/9404745.html
在模块login.py中定义了一个类Login，其中的登录方法代码：

    def login(self):
        # 第一步  获得时间戳、公钥、nonce等数据
        req = request.Request(url=self.__preloginUrl,headers=self.headers1,method="get")
        response = request.urlopen(req)
        text = response.read()
        servertime, nonce, pubkey, rsakv = dealdata.get_prelogin(text=text)

        # 第二步 向服务器发送post请求 登录信息
        postdata = dealdata.get_postData(self.__su,self.__password,servertime,nonce,pubkey,rsakv)
        req = request.Request(url=self.__loginUrl, headers=self.__postheaders, data=postdata,method="post")
        response = request.urlopen(req)
        text = response.read()

        # 第三步 解析登录响应数据 获取中间链接
        replaceUrl = dealdata.get_replaceUrl(text=text)

        # 分析登录结果
        result,retcode,reason = dealdata.get_reason(replaceUrl)
        if result==False:
            print("登录失败！")
            print("原因",reason)
            return
        else:
            print("登录成功！")
            print("正在向个人主页跳转...")

        # 第四步 加载中间链接 提取ticket
        response = request.urlopen(replaceUrl)
        text = response.read()
        ticket = dealdata.get_ticket(text=text)

        # 第五步 利用ticket组合关键部分构造网址 获得携带uniqueid的json数据
        uniqueidUrl = ticket + "&callback=sinaSSOController.doCrossDomainCallBack&scriptId=ssoscript0&client=ssologin.js(v1.4.19)&_=1564805281285"
        response = request.urlopen(uniqueidUrl)
        text = response.read()
        uniqueid = dealdata.get_uniqueid(text)

        # 第六步 跳转到主页
        print("进入个人主页...")
        homeUrl = "https://weibo.com/u/" + uniqueid + "/home"
        request.urlretrieve(homeUrl, "./html/home.html")