某方数据平台的逆向分析-------学会逆向protobuf网站（下）

princezyj

已于 2023-01-12 12:16:14 修改

阅读量1.3k

点赞数 5

文章标签： python 爬虫安全 javascript

于 2023-01-11 11:29:40 首次发布

本文链接：https://blog.csdn.net/qq_56881388/article/details/128641558

版权

某方数据平台的逆向分析-------学会逆向protobuf网站（下）

声明

本文章中所有内容仅供学习交流，相关链接做了脱敏处理，若有侵权，请联系我立即删除！

一、前言

上一节，我们构造了请求的proto文件，并成功用python发包获得了数据，但是得到的数据和f12得到的数据是一样的乱码如下图：

在这里插入图片描述

其实这个也是protobuf格式，发过去的是protobuf格式，收到的也是protobuf格式，只是它是以二进制序列化格式传输的，所以我们看上去像乱码。

那么我们要怎么让它变好看呢？或者怎样好处理呢？

接下来会带来两种方法：①便捷但不太直观，②直观但有点复杂。

大家因人而异选择处理就好，喜欢用哪个就用哪个。

二、介绍处理的两种方法

1.方法一：

直接使用python应对protobuf的第三方库：blackboxprotobuf

安装命令：pip install blackboxprotobuf

调用核心函数：blackboxprotobuf.decode_message(Byte类型数据)

进行解protobuf格式数据变成我们好理解，直观的数据。

返回值有两个：第一个是数据（和上一节用命令行解出来的那种格式类似，对应着proto文件的位置），第二个是对应位置的类型

这里我们传response.content进去就好了，但是你会发现报错了：

在这里插入图片描述

因为返回值我们还没分析，它发过来的有一部分是头信息，不在protobuf格式里面，所以解析会报错。

我先展示一下正确结果，给大家看看效果，第二种方法的时候，我会详细讲整个过程：

在这里插入图片描述

所以我们可以分析data的结构，找到我们需要的数据，然后取值。

取值就和取json数据一样：对象[键名] 这里就是data[‘2’] ===>爬虫

取你想要的，当然要拿出来数据到json解析的网站解析一下，然后分析：

这里推荐网站：https://spidertools.cn/#/

在这里插入图片描述

你可以看到，虽然拿到了数据，只是位置序号加内容，我们其实要靠猜才能知道是什么。

这就是我所说的便捷但不直观。

当然，blackboxprotobuf库还有其他函数api接口，你们可以查阅其他文章了解使用。

2.方法二：

写对应的响应的proto文件，和发包一样。当然可以和发包写在一起。

然后同样的用命令行编译成python版本的：

protoc --python_out=. ./test.proto

然后一样的创建实例对象，然后使用.ParseFromString(Byte类型数据)函数就能解析了

search_response = test_proto2_pb2.SearchService_SearchResponse()
search_response.ParseFromString(response.content[5:])
print(search_response)

这里用的是渔歌之前文章的例子做演示，我们看看效果：

在这里插入图片描述

是不是直观了很多，但是核心还是写proto文件，也是难点。然后分析为什么这里是取部分数据做解析。接下来，我们讲怎么构造响应的proto文件，然后为什么数据是从第6位（数组下标5）开始取。

我们f12开始分析，响应其实是最难调试的点，定位也很难，所以我们可以从请求的过程堆栈入手，说不定可以有所启发，发现可疑位置。

下xhr断点，跟着堆栈走，主要看app开头的js，因为chunk开头的是基本库，很少在里面做手脚，一般都是在自写的js里面做加密或其他操作。

在这里插入图片描述

可以发现h这里有点可疑，异步然后获得了值去.toObject,这个toObject就是proto文件转js的时候会产生的一个api函数接口，大家可以简单使用protoc去尝试转化成js看看。
（下载低于3.21.0 的protoc版本，因为原项目已将它独立出来，下载最新版本的protoc，运行js_out会缺少插件）

protoc --js_out=import_style=commonjs,binary:. 你的proto文件名.proto

然后下断点执行，你会看到下面数据如图：

在这里插入图片描述

发现和之前请求的是不是很像，其实这已经是解包好的了，所以应该在前面，但是异步我们不好跟了，或者我们在then那里下一个断点，然后单步跟，我就用另一种方式了，我们复制这一段，全局搜索一下：proto.SearchService.SearchResponse

会找到我这节要说的写响应，返回值proto文件的核心关键词：

deserializeBinary------deserializeBinaryFromReader（重点核心）

在这里插入图片描述

其实和请求包的关键词很像，就前面加了de，相当于解的意思。这里要多亏了志远大佬的指导，因为其实找响应的时候没一点头绪，然后很多文章都没讲清楚解包的关键词，异步也不太好跟，然后他稍微指点了一下，然后下断点就能定位了。

我们下一个断点看看，放开，重新点击：

在这里插入图片描述

我们成功来到这个位置，和fiddler数据对比一下：

在这里插入图片描述

发现是从第6位开始解包，所以就解决了上面我们的疑问，为什么从第6位开始取数据。

刚刚不是提到了两个关键词吗？

deserializeBinary------deserializeBinaryFromReader（重点核心）

在这里插入图片描述

和请求的一样理解，只是他现在变成了case语句来表示位置，read后面的类型来表示类型。

在这里插入图片描述

也是一层一层，慢慢的可以跟出来，当然，这个返回的数据量太大了，标号也特别的多，我们刚刚也看到了，所以有没有什么更好的方法得到proto文件呢？

那就是自写ast，然后用ast来处理这种switch语句。先安装babel解析库在当前目录下

npm install @babel/core --save-dev

我对ast还不是很熟悉，我就采用渔歌之前写好的（文末会附上链接），因为网站js也有小更新，之前的使用会报错。

按照文章，我们先把这一个js保存起来，我这里保存成1.js，运行

在这里插入图片描述

发现报错了，我稍微调试修改一下

原来的：

path.parentPath.node.right.body.body[0].body.body[2].cases.length

修改为：

path.parentPath.node.right.body.body[0].body.body[0].cases.length

还加了try处理，因为，有些是取不到length的，我也没太仔细分析原因，只是单纯从报错入手，然后小修改了一下。

proto2改成了proto3，optional关键字删除。

const parser = require("@babel/parser");
// 为parser提供模板引擎
const template = require("@babel/template").default;
// 遍历AST
const traverse = require("@babel/traverse").default;
// 操作节点，比如判断节点类型，生成新的节点等
const t = require("@babel/types");
// 将语法树转换为源代码
const generator = require("@babel/generator");
// 操作文件
const fs = require("fs");

//定义公共函数
function wtofile(path, flags, code) {
    var fd = fs.openSync(path,flags);
    fs.writeSync(fd, code);
    fs.closeSync(fd);
}

function dtofile(path) {
    fs.unlinkSync(path);
}

var file_path = '1.js'; //你要处理的文件
var jscode = fs.readFileSync(file_path, {
    encoding: "utf-8"
});

// 转换为AST语法树
let ast = parser.parse(jscode);
let proto_text = `syntax = "proto3";\n\n// protoc --python_out=. app_proto2.proto\n\n`;

traverse(ast, {
    MemberExpression(path){
        if(path.node.property.type === 'Identifier' && path.node.property.name === 'deserializeBinaryFromReader' && path.parentPath.type === 'AssignmentExpression'){
            let id_name = path.toString().split('.').slice(1, -1).join('_');
            path.parentPath.traverse({
                VariableDeclaration(path_2){
                    if(path_2.node.declarations.length === 1){
                        path_2.replaceWith(t.expressionStatement(
                            t.assignmentExpression(
                                "=",
                                path_2.node.declarations[0].id,
                                path_2.node.declarations[0].init
                            )
                        ))
                    }
                },
                SwitchStatement(path_2){
                    for (let i = 0; i < path_2.node.cases.length - 1; i++) {
                        let item = path_2.node.cases[i];
                        let item2 = path_2.node.cases[i + 1];
                        if(item.consequent.length === 0 && item2.consequent[1].expression.type === 'SequenceExpression'){
                            item.consequent = [
                                item2.consequent[0],
                                t.expressionStatement(
                                    item2.consequent[1].expression.expressions[0]
                                ),
                                item2.consequent[2]
                            ];
                            item2.consequent[1] = t.expressionStatement(
                                item2.consequent[1].expression.expressions[1]
                            )
                        }else if(item.consequent.length === 0){
                            item.consequent = item2.consequent
                        }else if(item.consequent[1].expression.type === 'SequenceExpression'){
                            item.consequent[1] = t.expressionStatement(
                                item.consequent[1].expression.expressions[1]
                            )
                        }
                    }
                }
            });
            let id_text = 'message ' + id_name + ' {\n';
            let let_id_list = [];
            try{
                // console.log(path.parentPath.node.right.body.body[0].body.body[0].cases.length);
                for (let i = 0; i < path.parentPath.node.right.body.body[0].body.body[0].cases.length; i++) {
                    let item = path.parentPath.node.right.body.body[0].body.body[0].cases[i];
                    if(item.test){
                        let id_number = item.test.value;
                        let key = item.consequent[1].expression.callee.property.name;
                        let id_st, id_type;
                        if(key.startsWith("set")){
                            id_st = "";
                        }else if(key.startsWith("add")){
                            id_st = "repeated";
                        }else{
                            // map类型，因为案例中用不到，所以这里省略
                            continue
                        }
                        key = key.substring(3, key.length);
                        id_type = item.consequent[0];
                        if(id_type.expression.right.type === 'NewExpression'){
                            id_type = generator.default(id_type.expression.right.callee).code.split('.').slice(1).join('_');
                        }else{
                            switch (id_type.expression.right.callee.property.name) {
                                case "readString":
                                    id_type = "string";
                                    break;
                                case "readDouble":
                                    id_type = "double";
                                    break;
                                case "readInt32":
                                    id_type = "int32";
                                    break;
                                case "readInt64":
                                    id_type = "int64";
                                    break;
                                case "readFloat":
                                    id_type = "float";
                                    break;
                                case "readBool":
                                    id_type = "bool";
                                    break;
                                case "readPackedInt32":
                                    id_st = "repeated";
                                    id_type = "int32";
                                    break;
                                case "readBytes":
                                    id_type = "bytes";
                                    break;
                                case "readEnum":
                                    id_type = "readEnum";
                                    break;
                                case "readPackedEnum":
                                    id_st = "repeated";
                                    id_type = "readEnum";
                                    break;
                            }
                        }
                        if(id_type === 'readEnum'){
                            id_type = id_name + '_' + key + 'Enum';
                            if(let_id_list.indexOf(id_number) === -1){
                                id_text += '\tenum ' + id_type + ' {\n';
                                for (let j = 0; j < 3; j++) {
                                    id_text += '\t\t' + id_type + 'TYPE_' + j + ' = ' + j + ';\n';
                                }
                                id_text += '\t}\n\n';
                                id_text += '\t' + id_st + ' ' + id_type + ' ' + key + ' = ' + id_number + ';\n';
                                let_id_list.push(id_number)
                            }
                        }else{
                            if(let_id_list.indexOf(id_number) === -1){
                                id_text += '\t' + id_st + ' ' + id_type + ' ' + key + ' = ' + id_number + ';\n';
                                let_id_list.push(id_number)
                            }
                        }
                    }
                }
            }catch(e){
            }

            id_text += '}\n\n';
            proto_text += id_text
        }
    }
});

wtofile('app_proto3.proto', 'w', proto_text);

这个ast代码单纯只是针对这个站点，其他站点也是类似分析。

运行完后，我们得到了app_proto3.proto文件，打开来，我们发现了报错，如下图，渔歌文章也讲清楚了原因，因为对象调用deserializeBinaryFromReader方法的时候，ast代码处理对象无法确定，所以就没加载到。

在这里插入图片描述

我们在调试里面，搜索关键词ExportResponse.deserializeBinaryFromReader

就能找到位置，根据我上面说的上下级关系，然后跟进去

在这里插入图片描述

就能找到s代表的是什么了。

然后我们依次像这样把报错补好就行了。

得到了proto文件，基本上我们的任务就完成了，我们发包试试吧。

在这里插入图片描述

可以看到很直观，取值也方便。我们的protobuf之旅也基本上结束了！

完整代码：

import requests
import app_proto3_pb2 as pb
import blackboxprotobuf

search_request = pb.SearchService_SearchRequest()
search_request.InterfaceType = 1
search_request.Commonrequest.SearchType = 'paper'
search_request.Commonrequest.SearchWord = '爬虫'
search_request.Commonrequest.CurrentPage = 2
search_request.Commonrequest.PageSize = 20
search_request.Commonrequest.SearchFilterList.append(0)
form_data = search_request.SerializeToString()
# with open('me1.bin', mode="wb") as f:
#     f.write(form_data)
# print(SearchRequest.SerializeToString().decode())
bytes_head = bytes([0, 0, 0, 0, len(form_data)])
# print(bytes_head+form_data)

headers = {
    "Accept": "*/*",
    "Accept-Language": "zh-CN,zh;q=0.9,zh-TW;q=0.8",
    "Content-Type": "application/grpc-web+proto",
    "Origin": "https://**********",
    "Referer": "https://**********/paper",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36",
}
url = "https://**********/SearchService.SearchService/search"
response=requests.post(url,headers=headers,data=bytes_head+form_data)
# print(response.content)
# deserialize_data, message_type = blackboxprotobuf.decode_message(response.content[5:])
# data, message_type = blackboxprotobuf.decode_message(response.content[5:])
# print(data)
# print(message_type)
search_response = pb.SearchService_SearchResponse()
search_response.ParseFromString(response.content[5:])
print(search_response)

三、注意：

其实这段代码还有点问题，就是最后取的response.content的真实长度是多少，从第6位开始，一共有多长呢，我们只是刚刚好取完是正确的。我们在从fiddler的抓包分析，刚刚不是前面还有5位16进制数：00 00 00 93 E7吗?我们在网站解析一下

在这里插入图片描述

刚刚好就是我们断点时的

在这里插入图片描述

所以前5位十六进制代表的是我们需要的响应数据的长度。

处理方法：

data_len=int.from_bytes(response.content[:5], 'big') #bytes转int
search_response = pb.SearchService_SearchResponse()
search_response.ParseFromString(response.content[5:5+data_len])