Python爬虫爬取某大型文章网站 (含js反爬)
注意,某大型文章网站 最新更新了算法 x-zse-96 此文章内容已经不可用,望周知!!!! 需要最新的算法可以私信联系我!!!!!
注意,某大型文章网站 最新更新了算法 x-zse-96 此文章内容已经不可用,望周知!!!! 需要最新的算法可以私信联系我!!!!!
注意,某大型文章网站 最新更新了算法 x-zse-96 此文章内容已经不可用,望周知!!!! 需要最新的算法可以私信联系我!!!!!
注意,某大型文章网站 最新更新了算法 x-zse-96 此文章内容已经不可用,望周知!!!! 需要最新的算法可以私信联系我!!!!!
注意,某大型文章网站 最新更新了算法 x-zse-96 此文章内容已经不可用,望周知!!!! 需要最新的算法可以私信联系我!!!!!
注意,某大型文章网站 最新更新了算法 x-zse-96 此文章内容已经不可用,望周知!!!! 需要最新的算法可以私信联系我!!!!!
最近的业务要求去写一个关于某大型文章网站的爬虫,在这记录一下在爬取过程中出现的问题以及解决方法
众所周知,某大型文章网站 是需要登录的。
登录以后,搜索一个关键字,我这里搜索 奥运会
在这里想要获取每一条内容的标题和第一条用户回答(就是点击阅读全文后的内容):
而且想要获取多条,比如100条,需要怎么做呢?
先看一下某大型文章网站 展现内容容的机制,发现鼠标滚轮往下滑的时候,内容被一点一点加载出来了,这里显然是用的ajax加载。
知道了加载方式后,打开开发者工具找一下ajax的来源:
找到ajax加载的链接就是:https://www.zhihu.com/api/v4/search_v3?t=general&q=%E5%A5%A5%E8%BF%90%E4%BC%9A&correction=1&offset=27&limit=20&filter_fields=&lc_idx=27&show_all_topics=0&search_hash_id=22259bd69d7464f7596baea3a0a88cb3&vertical_info=0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C1
再来看一下响应出来的json内容:
内容也找到后,接下来就开始写爬虫程序了获取响应内容了。
先来剖析一下url主要组成部分:
https://www.zhihu.com/api/v4/search_v3?t=general&q=%E5%A5%A5%E8%BF%90%E4%BC%9A&correction=1&offset=27&limit=20&filter_fields=&lc_idx=27&show_all_topics=0&search_hash_id=22259bd69d7464f7596baea3a0a88cb3&vertical_info=0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C1
q:关键字(奥运会)
offset:从第几条开始加载
limit:每次加载的最小数量
主要就是上面这几个字段,剩下的字段目前不会影响获取的结果,爬取的时候复制粘贴就好了
爬取的时候一定要加上cookie
代码写好开始爬的时候发现改了关键字q或者offset,limit后,就获取不到内容了。
这是某大型文章网站 设置的反爬,将cookie里面的字段x-zse-96做了加密。
ctrl+f查找x-zst-96的位置:
在这个js文件里面,点进去:
可以看到x-zse-96的设置方式为
“2.0_” + j
j == A.signature
那A.signature又是什么呢?
继续ctrl+f查找signature
这里能看到signature就是一个对变量d做一系列的处理
现在需要知道d是什么,在这里通过打断点查看:(打断点以后需要滚轮往下滑,让页面再加载出来一个ajax)
d的值为:101_3_2.0+/api/v4/search_v3?t=general&q=%E5%A5%A5%E8%BF%90%E4%BC%9A&correction=1&offset=47&limit=20&filter_fields=&lc_idx=47&show_all_topics=0&search_hash_id=fd42386d5e76e4ec8fd3d6ea8461359d&vertical_info=0%2C1%2C1%2C0%2C0%2C0%2C0%2C0%2C0%2C1+“AADcrvhTRxGPTuT63H8IufBXUPD5a719Uuo=|1589624518”+3_2.0ae3TnRUTEvOOUCNMTQnTSHUZo02p-HNMZBO8YD_02XtucXYqK6P0E79y-LS9-hp1DufI-we8gGHPgJO1xuPZ0GxCTJHR7820XM20cLRGDJXfgGCBxupMuD_Ie8FL7AtqM6O1VDQyQ6nxrRPCHukMoCXBEgOsiRP0XL2ZUBXmDDV9qhnyTXFMnXcTF_ntRueThR3YzqgLQCN8fDO0GAe08coCi9Nxu9Y95GL8ZbX9JcOsnDc_uwxKeHC_OGLPv_t1EDHKJhgq2GY8_bNC5hxBoJoCkXOYShOCBvLOycc9jUopTbH_bUCBkDwGqB2p-h3msre9IgN0ZCVGNUFOpqoY6JCmZu3mf9XyGvwBsCLLbJx1YvNC4qfzeTc_IbOm6BtqGAefSrVCg9OYfCg0UUNG_DrMzcXBo9eqfC2_yUpCOCOp-9VOZC2ms9LGzvX8tDofpePB6RNm2wxKrcOB19LC0hrGS_xVQqCK1XeL6AUC
d是个字符串,用+号连接了下面四个字段:
101_3_2.0
/api/v4/search_v3?t=general&q=%E5%A5%A5%E8%BF%90%E4%BC%9A&correction=1&offset=47&limit=20&filter_fields=&lc_idx=47&show_all_topics=0&search_hash_id=fd42386d5e76e4ec8fd3d6ea8461359d&vertical_info=0%2C1%2C1%2C0%2C0%2C0%2C0%2C0%2C0%2C1
“AADcrvhTRxGPTuT63H8IufBXUPD5a719Uuo=|1589624518”
3_2.0ae3TnRUTEvOOUCNMTQnTSHUZo02p-HNMZBO8YD_02XtucXYqK6P0E79y-LS9-hp1DufI-we8gGHPgJO1xuPZ0GxCTJHR7820XM20cLRGDJXfgGCBxupMuD_Ie8FL7AtqM6O1VDQyQ6nxrRPCHukMoCXBEgOsiRP0XL2ZUBXmDDV9qhnyTXFMnXcTF_ntRueThR3YzqgLQCN8fDO0GAe08coCi9Nxu9Y95GL8ZbX9JcOsnDc_uwxKeHC_OGLPv_t1EDHKJhgq2GY8_bNC5hxBoJoCkXOYShOCBvLOycc9jUopTbH_bUCBkDwGqB2p-h3msre9IgN0ZCVGNUFOpqoY6JCmZu3mf9XyGvwBsCLLbJx1YvNC4qfzeTc_IbOm6BtqGAefSrVCg9OYfCg0UUNG_DrMzcXBo9eqfC2_yUpCOCOp-9VOZC2ms9LGzvX8tDofpePB6RNm2wxKrcOB19LC0hrGS_xVQqCK1XeL6AUC
显而易见:
101_3_2.0 是固定字符串
/api/v4/search_v3?t=general&q=%E5%A5%A5%E8%BF%90%E4%BC%9A&correction=1&offset=47&limit=20&filter_fields=&lc_idx=47&show_all_topics=0&search_hash_id=fd42386d5e76e4ec8fd3d6ea8461359d&vertical_info=0%2C1%2C1%2C0%2C0%2C0%2C0%2C0%2C0%2C1 是要请求的url
“AADcrvhTRxGPTuT63H8IufBXUPD5a719Uuo=|1589624518” 是在cookie里面的d_c0字段
3_2.0ae3TnRUTEvOOUCNMTQnTSHUZo02p-HNMZBO8YD_02XtucXYqK6P0E79y-LS9-hp1DufI-we8gGHPgJO1xuPZ0GxCTJHR7820XM20cLRGDJXfgGCBxupMuD_Ie8FL7AtqM6O1VDQyQ6nxrRPCHukMoCXBEgOsiRP0XL2ZUBXmDDV9qhnyTXFMnXcTF_ntRueThL39TcV9yULfYUOfuDVYb0pfuqtM6TxOdut1hgLLbh2XL9H1s6xfIrN_thOPv_t1EDHKJhgq2GY8_bNC5hxBoJoCkXOYShOCBvLOycc9jUopTbH_bUCBkDwGqB2p-h3msre9IgN0ZCVGNUFOpqoY6JCmZu3mf9XyGvwBsCLLbJx1YvNC4qfzeTc_IbOm6BtqGAefSrVCg9OYfCg0UUNG_DrMzcXBo9eqfC2_yUpCOCOp-9VOZC2ms9LGzvX8tDofpePB6RNm2wxKrcOB19LC0hrGS_xVQqCK1XeL6AUC 是在header里面的x_zst_81字段
知道了d是如何产生的以后,再去看看r.default(d)是什么
“599cad31228e54b728bec8c6e72ff82f”
32位字符串很容易让人联想到为md5加密,于是用d字段去md5加密一下看看结果
附上MD5加密网址:https://md5jiami.bmcx.com/
结果和预想的一样
到现在,d字段知道了,d字段的加密方式知道了(MD5)
就差A.signature不知道了,A.signature是知乎自己设置的加密方式__g._encrypt加密(在这里不对这个加密做过多解释)
直接上python代码:
import execjs
import hashlib
# d
str_to_md5 = '101_3_2.0+/api/v4/search_v3?t=general&q=%E5%A5%A5%E8%BF%90%E4%BC%9A&correction=1&offset=29&limit=20&filter_fields=&lc_idx=29&show_all_topics=0&search_hash_id=fd7606ac4bf1ac1c7e0252021af559a4&vertical_info=0%2C1%2C1%2C0%2C0%2C0%2C0%2C0%2C0%2C1+"AADcrvhTRxGPTuT63H8IufBXUPD5a719Uuo=|1589624518"+3_2.0ae3TnRUTEvOOUCNMTQnTSHUZo02p-HNMZBO8YD_02XtucXYqK6P0E79y-LS9-hp1DufI-we8gGHPgJO1xuPZ0GxCTJHR7820XM20cLRGDJXfgGCBxupMuD_Ie8FL7AtqM6O1VDQyQ6nxrRPCHukMoCXBEgOsiRP0XL2ZUBXmDDV9qhnyTXFMnXcTF_ntRueThLxmNvOy67L1VCxmnDXB_bL8VDLpA92f6BpfUvcY8DCfbweMqBLC3bo8tgCPv_t1EDHKJhgq2GY8_bNC5hxBoJoCkXOYShOCBvLOycc9jUopTbH_bUCBkDwGqB2p-h3msre9IgN0ZCVGNUFOpqoY6JCmZu3mf9XyGvwBsCLLbJx1YvNC4qfzeTc_IbOm6BtqGAefSrVCg9OYfCg0UUNG_DrMzcXBo9eqfC2_yUpCOCOp-9VOZC2ms9LGzvX8tDofpePB6RNm2wxKrcOB19LC0hrGS_xVQqCK1XeL6AUC'
# d转换成md5
fmd5 = hashlib.md5(str_to_md5.encode(encoding='UTF-8')).hexdigest()
# A.signature:__g._encrypt加密
with open('g_encrypt.js', 'r' ,encoding='utf-8') as f:
ctx1 = execjs.compile(f.read())
encrypt_str = ctx1.call('b', fmd5)
print(encrypt_str)# aXYyNg98gwxxb0O80BtBNcrqrLtYNwSqMLF0nJ9qnG2p
运行结果为:aXYyNg98gwxxb0O80BtBNcrqrLtYNwSqMLF0nJ9qnG2p
和header中的x-zse-96一样
在python中运行js代码需要安装jsdom。
大致操作为:①去官网下载nodejs安装,②npm install jsdom ③在node_modules文件夹里检查有没有jsdom文件夹,有则代表安装成功,将此路径复制下来在代码里使用(也可以不写路径)
在这里附上这位前辈 一只不会爬的虫子 分享的js代码:
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
window = dom.window;
document = window.document;
XMLHttpRequest = window.XMLHttpRequest;
var exports = {}
function t(e) {
return (t = "function" == typeof Symbol && "symbol" == typeof Symbol.A ? function(e) {
return typeof e
}
: function(e) {
return e && "function" == typeof Symbol && e.constructor === Symbol && e !== Symbol.prototype ? "symbol" : typeof e
}
)(e)
}
Object.defineProperty(exports, "__esModule", {
value: !0
});
var A = "2.0"
, __g = {};
function s() {}
function i(e) {
this.t = (2048 & e) >> 11,
this.s = (1536 & e) >> 9,
this.i = 511 & e,
this.h = 511 & e
}
function h(e) {
this.s = (3072 & e) >> 10,
this.h = 1023 & e
}
function a(e) {
this.a = (3072 & e) >> 10,
this.c = (768 & e) >> 8,
this.n = (192 & e) >> 6,
this.t = 63 & e
}
function c(e) {
this.s = e >> 10 & 3,
this.i = 1023 & e
}
function n() {}
function e(e) {
this.a = (3072 & e) >> 10,
this.c = (768 & e) >> 8,
this.n = (192 & e) >> 6,
this.t = 63 & e
}
function o(e) {
this.h = (4095 & e) >> 2,
this.t = 3 & e
}
function r(e) {
this.s = e >> 10 & 3,
this.i = e >> 2 & 255,
this.t = 3 & e
}
s.prototype.e = function(e) {
e.o = !1
}
,
i.prototype.e = function(e) {
switch (this.t) {
case 0:
e.r[this.s] = this.i;
break;
case 1:
e.r[this.s] = e.k[this.h]
}
}
,
h.prototype.e = function(e) {
e.k[this.h] = e.r[this.s]
}
,
a.prototype.e = function(e) {
switch (this.t) {
case 0:
e.r[this.a] = e.r[this.c] + e.r[this.n];
break;
case 1:
e.r[this.a] = e.r[this.c] - e.r[this.n];
break;
case 2:
e.r[this.a] = e.r[this.c] * e.r[this.n];
break;
case 3:
e.r[this.a] = e.r[this.c] / e.r[this.n];
break;
case 4:
e.r[this.a] = e.r[this.c] % e.r[this.n];
break;
case 5:
e.r[this.a] = e.r[this.c] == e.r[this.n];
break;
case 6:
e.r[this.a] = e.r[this.c] >= e.r[this.n];
break;
case 7:
e.r[this.a] = e.r[this.c] || e.r[this.n];
break;
case 8:
e.r[this.a] = e.r[this.c] && e.r[this.n];
break;
case 9:
e.r[this.a] = e.r[this.c] !== e.r[this.n];
break;
case 10:
e.r[this.a] = t(e.r[this.c]);
break;
case 11:
e.r[this.a] = e.r[this.c]in e.r[this.n];
break;
case 12:
e.r[this.a] = e.r[this.c] > e.r[this.n];
break;
case 13:
e.r[this.a] = -e.r[this.c];
break;
case 14:
e.r[this.a] = e.r[this.c] < e.r[this.n];
break;
case 15:
e.r[this.a] = e.r[this.c] & e.r[this.n];
break;
case 16:
e.r[this.a] = e.r[this.c] ^ e.r[this.n];
break;
case 17:
e.r[this.a] = e.r[this.c] << e.r[this.n];
break;
case 18:
e.r[this.a] = e.r[this.c] >>> e.r[this.n];
break;
case 19:
e.r[this.a] = e.r[this.c] | e.r[this.n];
break;
case 20:
e.r[this.a] = !e.r[this.c]
}
}
,
c.prototype.e = function(e) {
e.Q.push(e.C),
e.B.push(e.k),
e.C = e.r[this.s],
e.k = [];
for (var t = 0; t < this.i; t++)
e.k.unshift(e.f.pop());
e.g.push(e.f),
e.f = []
}
,
n.prototype.e = function(e) {
e.C = e.Q.pop(),
e.k = e.B.pop(),
e.f = e.g.pop()
}
,
e.prototype.e = function(e) {
switch (this.t) {
case 0:
e.u = e.r[this.a] >= e.r[this.c];
break;
case 1:
e.u = e.r[this.a] <= e.r[this.c];
break;
case 2:
e.u = e.r[this.a] > e.r[this.c];
break;
case 3:
e.u = e.r[this.a] < e.r[this.c];
break;
case 4:
e.u = e.r[this.a] == e.r[this.c];
break;
case 5:
e.u = e.r[this.a] != e.r[this.c];
break;
case 6:
e.u = e.r[this.a];
break;
case 7:
e.u = !e.r[this.a]
}
}
,
o.prototype.e = function(e) {
switch (this.t) {
case 0:
e.C = this.h;
break;
case 1:
e.u && (e.C = this.h);
break;
case 2:
e.u || (e.C = this.h);
break;
case 3:
e.C = this.h,
e.w = null
}
e.u = !1
}
,
r.prototype.e = function(e) {
switch (this.t) {
case 0:
for (var t = [], n = 0; n < this.i; n++)
t.unshift(e.f.pop());
e.r[3] = e.r[this.s](t[0], t[1]);
break;
case 1:
for (var r = e.f.pop(), o = [], i = 0; i < this.i; i++)
o.unshift(e.f.pop());
e.r[3] = e.r[this.s][r](o[0], o[1]);
break;
case 2:
for (var a = [], c = 0; c < this.i; c++)
a.unshift(e.f.pop());
e.r[3] = new e.r[this.s](a[0],a[1])
}
}
;
var k = function(e) {
for (var t = 66, n = [], r = 0; r < e.length; r++) {
var o = 24 ^ e.charCodeAt(r) ^ t;
n.push(String.fromCharCode(o)),
t = o
}
return n.join("")
};
function Q(e) {
this.t = (4095 & e) >> 10,
this.s = (1023 & e) >> 8,
this.i = 1023 & e,
this.h = 63 & e
}
function C(e) {
this.t = (4095 & e) >> 10,
this.a = (1023 & e) >> 8,
this.c = (255 & e) >> 6
}
function B(e) {
this.s = (3072 & e) >> 10,
this.h = 1023 & e
}
function f(e) {
this.h = 4095 & e
}
function g(e) {
this.s = (3072 & e) >> 10
}
function u(e) {
this.h = 4095 & e
}
function w(e) {
this.t = (3840 & e) >> 8,
this.s = (192 & e) >> 6,
this.i = 63 & e
}
function G() {
this.r = [0, 0, 0, 0],
this.C = 0,
this.Q = [],
this.k = [],
this.B = [],
this.f = [],
this.g = [],
this.u = !1,
this.G = [],
this.b = [],
this.o = !1,
this.w = null,
this.U = null,
this.F = [],
this.R = 0,
this.J = {
0: s,
1: i,
2: h,
3: a,
4: c,
5: n,
6: e,
7: o,
8: r,
9: Q,
10: C,
11: B,
12: f,
13: g,
14: u,
15: w
}
}
Q.prototype.e = function(e) {
switch (this.t) {
case 0:
e.f.push(e.r[this.s]);
break;
case 1:
e.f.push(this.i);
break;
case 2:
e.f.push(e.k[this.h]);
break;
case 3:
e.f.push(k(e.b[this.h]))
}
}
,
C.prototype.e = function(A) {
switch (this.t) {
case 0:
var t = A.f.pop();
A.r[this.a] = A.r[this.c][t];
break;
case 1:
var s = A.f.pop()
, i = A.f.pop();
A.r[this.c][s] = i;
break;
case 2:
var h = A.f.pop();
A.r[this.a] = eval(h)
}
}
,
B.prototype.e = function(e) {
e.r[this.s] = k(e.b[this.h])
}
,
f.prototype.e = function(e) {
e.w = this.h
}
,
g.prototype.e = function(e) {
throw e.r[this.s]
}
,
u.prototype.e = function(e) {
var t = this
, n = [0];
e.k.forEach(function(e) {
n.push(e)
});
var r = function(r) {
var o = new G;
return o.k = n,
o.k[0] = r,
o.v(e.G, t.h, e.b, e.F),
o.r[3]
};
r.toString = function() {
return "() { [native code] }"
}
,
e.r[3] = r
}
,
w.prototype.e = function(e) {
switch (this.t) {
case 0:
for (var t = {}, n = 0; n < this.i; n++) {
var r = e.f.pop();
t[e.f.pop()] = r
}
e.r[this.s] = t;
break;
case 1:
for (var o = [], i = 0; i < this.i; i++)
o.unshift(e.f.pop());
e.r[this.s] = o
}
}
,
G.prototype.D = function(e) {
console.log(window.atob(e));
for (var t = window.atob(e), n = t.charCodeAt(0) << 8 | t.charCodeAt(1), r = [], o = 2; o < n + 2; o += 2)
r.push(t.charCodeAt(o) << 8 | t.charCodeAt(o + 1));
this.G = r;
for (var i = [], a = n + 2; a < t.length; ) {
var c = t.charCodeAt(a) << 8 | t.charCodeAt(a + 1)
, s = t.slice(a + 2, a + 2 + c);
i.push(s),
a += c + 2
}
this.b = i
}
,
G.prototype.v = function(e, t, n) {
for (t = t || 0,
n = n || [],
this.C = t,
"string" == typeof e ? this.D(e) : (this.G = e,
this.b = n),
this.o = !0,
this.R = Date.now(); this.o; ) {
var r = this.G[this.C++];
if ("number" != typeof r)
break;
var o = Date.now();
if (500 < o - this.R)
return;
this.R = o;
try {
this.e(r)
} catch (e) {
this.U = e,
this.w && (this.C = this.w)
}
}
}
,
G.prototype.e = function(e) {
var t = (61440 & e) >> 12;
new this.J[t](e).e(this)
}
,
(new G).v("AxjgB5MAnACoAJwBpAAAABAAIAKcAqgAMAq0AzRJZAZwUpwCqACQACACGAKcBKAAIAOcBagAIAQYAjAUGgKcBqFAuAc5hTSHZAZwqrAIGgA0QJEAJAAYAzAUGgOcCaFANRQ0R2QGcOKwChoANECRACQAsAuQABgDnAmgAJwMgAGcDYwFEAAzBmAGcSqwDhoANECRACQAGAKcD6AAGgKcEKFANEcYApwRoAAxB2AGcXKwEhoANECRACQAGAKcE6AAGgKcFKFANEdkBnGqsBUaADRAkQAkABgCnBagAGAGcdKwFxoANECRACQAGAKcGKAAYAZx+rAZGgA0QJEAJAAYA5waoABgBnIisBsaADRAkQAkABgCnBygABoCnB2hQDRHZAZyWrAeGgA0QJEAJAAYBJwfoAAwFGAGcoawIBoANECRACQAGAOQALAJkAAYBJwfgAlsBnK+sCEaADRAkQAkABgDkACwGpAAGAScH4AJbAZy9rAiGgA0QJEAJACwI5AAGAScH6AAkACcJKgAnCWgAJwmoACcJ4AFnA2MBRAAMw5gBnNasCgaADRAkQAkABgBEio0R5EAJAGwKSAFGACcKqAAEgM0RCQGGAYSATRFZAZzshgAtCs0QCQAGAYSAjRFZAZz1hgAtCw0QCQAEAAgB7AtIAgYAJwqoAASATRBJAkYCRIANEZkBnYqEAgaBxQBOYAoBxQEOYQ0giQKGAmQABgAnC6ABRgBGgo0UhD/MQ8zECALEAgaBxQBOYAoBxQEOYQ0gpEAJAoYARoKNFIQ/zEPkAAgChgLGgkUATmBkgAaAJwuhAUaCjdQFAg5kTSTJAsQCBoHFAE5gCgHFAQ5hDSCkQAkChgBGgo0UhD/MQ+QACAKGAsaCRQCOYGSABoAnC6EBRoKN1AUEDmRNJMkCxgFGgsUPzmPkgAaCJwvhAU0wCQFGAUaCxQGOZISPzZPkQAaCJwvhAU0wCQFGAUaCxQMOZISPzZPkQAaCJwvhAU0wCQFGAUaCxQSOZISPzZPkQAaCJwvhAU0wCQFGAkSAzRBJAlz/B4FUAAAAwUYIAAIBSITFQkTERwABi0GHxITAAAJLwMSGRsXHxMZAAk0Fw8HFh4NAwUABhU1EBceDwAENBcUEAAGNBkTGRcBAAFKAAkvHg4PKz4aEwIAAUsACDIVHB0QEQ4YAAsuAzs7AAoPKToKDgAHMx8SGQUvMQABSAALORoVGCQgERcCAxoACAU3ABEXAgMaAAsFGDcAERcCAxoUCgABSQAGOA8LGBsPAAYYLwsYGw8AAU4ABD8QHAUAAU8ABSkbCQ4BAAFMAAktCh8eDgMHCw8AAU0ADT4TGjQsGQMaFA0FHhkAFz4TGjQsGQMaFA0FHhk1NBkCHgUbGBEPAAFCABg9GgkjIAEmOgUHDQ8eFSU5DggJAwEcAwUAAUMAAUAAAUEADQEtFw0FBwtdWxQTGSAACBwrAxUPBR4ZAAkqGgUDAwMVEQ0ACC4DJD8eAx8RAAQ5GhUYAAFGAAAABjYRExELBAACWhgAAVoAQAg/PTw0NxcQPCQ5C3JZEBs9fkcnDRcUAXZia0Q4EhQgXHojMBY3MWVCNT0uDhMXcGQ7AUFPHigkQUwQFkhaAkEACjkTEQspNBMZPC0ABjkTEQsrLQ==");
function b(e) {
console.log(e);
console.log(encodeURIComponent(e));
return __g._encrypt(encodeURIComponent(e))
};
没问题以后就可以开开心心的跑程序啦!
注:某大型文章网站 一直在更新反爬,注意制定反爬策略。
总结某大型文章网站 爬虫大致流程如下:
1.通过ajax找到内容产生逻辑
2.直接写获取内容的爬虫来通过cookie找到知乎设置反爬的字段
3.通过字段ctrl+f全局查找找到字段位置
4.通过打断点的方式让ajax加载出来,找到x-zse-96生成逻辑(字符串d产生逻辑>>字符串d的加密逻辑md5>>最终execjs解决A.signature的加密方式__g._encrypt)
想要源码的朋友们可以评论留言哦~