总结一下遇到问题,之前爬头条用户文章,是加密的嘛。然后找了相关文章看了一下,懂了。
胜利的道路并不是一帆风顺。
1、window.byted_acrawler.sign({url:url}) 需要传入一个url,但是这个url,当时打断点,看到一个就随便填了进去,发现并不行。生成的_signature并不能用。
解决:太心急了,我要爬的是
文章这个板块,刷新网页没等出来就断点了,就直接选了个url填进去,应该继续断点,找到文章加载出来的那个时候,断点的那个url传入
2、 补环境,网上有很多补环境。但是我不知道能不能通用,我还是按照console输出去补的环境
可以看到console输出里面有document、location、navigator都有的,我是复制这里面的
3、详情页面,我拿到列表的json,里面有url,但是不能用里面的url,这点我太大意了。
这里面的url,article_url都不要用,如果用这个url作为详情页去爬详情页,什么都没有。 因为前面没有加www,我建议是https://www.toutiao.com/a......./当做详情页链接去爬取。就可以有了。
4、列表文章页有时候数据为空问题。我发现我爬头条,很卡很卡,和我网速无关。经常会出现没数据,所以加了一个判断,数据为空就重新调用函数。
# -*- coding: utf-8 -*-
import random,time
import requests,execjs,json
from lxml import etree
headers = {
"accept": "application/json, text/plain, */*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9",
"cookie": "_S_IPAD=0; WIN_WH=375_812; FRM=new; PIXIEL_RATIO=3; _ga=GA1.2.116917834.1620437294; csrftoken=7f9d24dc3abc2aeab049c30b8d86ab5f; tt_webid=7023928361337013791; MONITOR_DEVICE_ID=f9c022a2-96f7-402b-a659-887a67bf9de2; tt_webid=7023928361337013791; _S_DPR=1; _S_WIN_WH=1920_937; _S_UA=Mozilla%2F5.0%20(Windows%20NT%2010.0%3B%20Win64%3B%20x64)%20AppleWebKit%2F537.36%20(KHTML%2C%20like%20Gecko)%20Chrome%2F96.0.4664.45%20Safari%2F537.36; passport_csrf_token_default=ef1b86b4eb117d3299bee8368d01ab81; passport_csrf_token=ef1b86b4eb117d3299bee8368d01ab81; __ac_signature=_02B4Z6wo00f010IVpCgAAIDCIR9kQxtOfytCMaCAALFKNgvOj28czTtzYxDw-ogM4jdjKXztAzAfWMK68gbX3B0jTjzqas8pyIqPhuRuZzM6ufZczrVdBY3EBepHCIYbinixJJsCvd.7NMII6d; sso_uid_tt_ss=275b0585fed5f121eb96bc20ab5d0f7f; toutiao_sso_user=89c87ee5448b210895bb6f776878f045; toutiao_sso_user_ss=89c87ee5448b210895bb6f776878f045; sid_ucp_sso_v1=1.0.0-KGY2Yzg2NjgzYjg2M2Y2MmE3ZDAyZGMwNGYxNzYzYmJhZjMxZWZiOGEKDwja8NLA2wIQrevrjQYYGBoCbGYiIDg5Yzg3ZWU1NDQ4YjIxMDg5NWJiNmY3NzY4NzhmMDQ1; ssid_ucp_sso_v1=1.0.0-KGY2Yzg2NjgzYjg2M2Y2MmE3ZDAyZGMwNGYxNzYzYmJhZjMxZWZiOGEKDwja8NLA2wIQrevrjQYYGBoCbGYiIDg5Yzg3ZWU1NDQ4YjIxMDg5NWJiNmY3NzY4NzhmMDQ1; sso_uid_tt=275b0585fed5f121eb96bc20ab5d0f7f; uid_tt=02d6a9f1037a25d0463f338f9d289ad3; uid_tt_ss=02d6a9f1037a25d0463f338f9d289ad3; sid_tt=570057f240cce2b4ad32a018872e5faf; sessionid=570057f240cce2b4ad32a018872e5faf; sessionid_ss=570057f240cce2b4ad32a018872e5faf; sid_ucp_v1=1.0.0-KDRiNjlmZmNiMjliZjYzMjUxZTEwNDFkM2U5MGVjODYwZjJlZDFiZjgKDwja8NLA2wIQrevrjQYYGBoCbGYiIDU3MDA1N2YyNDBjY2UyYjRhZDMyYTAxODg3MmU1ZmFm; ssid_ucp_v1=1.0.0-KDRiNjlmZmNiMjliZjYzMjUxZTEwNDFkM2U5MGVjODYwZjJlZDFiZjgKDwja8NLA2wIQrevrjQYYGBoCbGYiIDU3MDA1N2YyNDBjY2UyYjRhZDMyYTAxODg3MmU1ZmFm; sid_guard=570057f240cce2b4ad32a018872e5faf%7C1639642541%7C3024000%7CThu%2C+20-Jan-2022+08%3A15%3A41+GMT; MONITOR_WEB_ID=93282678874; s_v_web_id=verify_kx9p6yqz_XRzeZVXJ_RD5E_44BU_820w_smMAHTFIaw5i; tt_anti_token=cDSrB8zueOSwHGI-cb679e9b34cc9430a1ff05eb213fc0b8a0de7b46b3b92a4179f66a4f53d29416; ttwid=1%7C_KtNzqt5kI03bn7fBVhj-rQCaJxjAKSak6Iioy2mFYs%7C1639703837%7C0f0c312dd7b0efa93a9569861116940f73f56e64888b66871a8a6d5ce2a0e203; tt_scid=H-LKqsTsVysKe3N.at4iR.QdIlw895hLWBduQA.NjQ-g1h7Kw0jDEC4pDwrC6HyV0023",
"referer": "https://www.toutiao.com/c/user/token/MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc/??tab=article?tab=article?tab=article",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}
headers_detail = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding": "gzip, deflate",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "max-age=0",
"cookie": "_S_IPAD=0; WIN_WH=375_812; FRM=new; PIXIEL_RATIO=3; _ga=GA1.2.116917834.1620437294; csrftoken=7f9d24dc3abc2aeab049c30b8d86ab5f; tt_webid=7023928361337013791; MONITOR_DEVICE_ID=f9c022a2-96f7-402b-a659-887a67bf9de2; tt_webid=7023928361337013791; _S_DPR=1; _S_WIN_WH=1920_937; _S_UA=Mozilla%2F5.0%20(Windows%20NT%2010.0%3B%20Win64%3B%20x64)%20AppleWebKit%2F537.36%20(KHTML%2C%20like%20Gecko)%20Chrome%2F96.0.4664.45%20Safari%2F537.36; passport_csrf_token_default=ef1b86b4eb117d3299bee8368d01ab81; passport_csrf_token=ef1b86b4eb117d3299bee8368d01ab81; __ac_signature=_02B4Z6wo00f010IVpCgAAIDCIR9kQxtOfytCMaCAALFKNgvOj28czTtzYxDw-ogM4jdjKXztAzAfWMK68gbX3B0jTjzqas8pyIqPhuRuZzM6ufZczrVdBY3EBepHCIYbinixJJsCvd.7NMII6d; sso_uid_tt_ss=275b0585fed5f121eb96bc20ab5d0f7f; toutiao_sso_user=89c87ee5448b210895bb6f776878f045; toutiao_sso_user_ss=89c87ee5448b210895bb6f776878f045; sid_ucp_sso_v1=1.0.0-KGY2Yzg2NjgzYjg2M2Y2MmE3ZDAyZGMwNGYxNzYzYmJhZjMxZWZiOGEKDwja8NLA2wIQrevrjQYYGBoCbGYiIDg5Yzg3ZWU1NDQ4YjIxMDg5NWJiNmY3NzY4NzhmMDQ1; ssid_ucp_sso_v1=1.0.0-KGY2Yzg2NjgzYjg2M2Y2MmE3ZDAyZGMwNGYxNzYzYmJhZjMxZWZiOGEKDwja8NLA2wIQrevrjQYYGBoCbGYiIDg5Yzg3ZWU1NDQ4YjIxMDg5NWJiNmY3NzY4NzhmMDQ1; sso_uid_tt=275b0585fed5f121eb96bc20ab5d0f7f; uid_tt=02d6a9f1037a25d0463f338f9d289ad3; uid_tt_ss=02d6a9f1037a25d0463f338f9d289ad3; sid_tt=570057f240cce2b4ad32a018872e5faf; sessionid=570057f240cce2b4ad32a018872e5faf; sessionid_ss=570057f240cce2b4ad32a018872e5faf; sid_ucp_v1=1.0.0-KDRiNjlmZmNiMjliZjYzMjUxZTEwNDFkM2U5MGVjODYwZjJlZDFiZjgKDwja8NLA2wIQrevrjQYYGBoCbGYiIDU3MDA1N2YyNDBjY2UyYjRhZDMyYTAxODg3MmU1ZmFm; ssid_ucp_v1=1.0.0-KDRiNjlmZmNiMjliZjYzMjUxZTEwNDFkM2U5MGVjODYwZjJlZDFiZjgKDwja8NLA2wIQrevrjQYYGBoCbGYiIDU3MDA1N2YyNDBjY2UyYjRhZDMyYTAxODg3MmU1ZmFm; sid_guard=570057f240cce2b4ad32a018872e5faf%7C1639642541%7C3024000%7CThu%2C+20-Jan-2022+08%3A15%3A41+GMT; MONITOR_WEB_ID=93282678874; s_v_web_id=verify_kx9p6yqz_XRzeZVXJ_RD5E_44BU_820w_smMAHTFIaw5i; tt_anti_token=nSAlvifb-5d23fbe65f1e5b16e89dc18373d171efd78a4f366e7e426007e86ed2a75b7ba9; ttwid=1%7C_KtNzqt5kI03bn7fBVhj-rQCaJxjAKSak6Iioy2mFYs%7C1639703742%7Cf86a130fd0e0ab9fa0c00e44d6d99ca0dac86881c80929fa300ccf9dab0abca7; tt_scid=1jGPvYY-oOkiHriVys13Wgx50YecYX.wb7eP.BYr20LVZ4MMQvRktJFnndDtXxroe227",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"
}
master_url = "https://www.toutiao.com/api/pc/list/user/feed?category=pc_profile_article&token=MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc&max_behot_time={max_behot_time}&aid=24&app_name=toutiao_web&_signature={_signature}"
infos_list = []
def get_signature(max_behot_time):
with open("answer.js",encoding='utf-8') as f:
jscode = f.read()
signature = execjs.compile(jscode).call("get_signature","https://www.toutiao.com/api/pc/list/user/feed?category=pc_profile_article&token=MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc&max_behot_time={max_behot_time}&aid=24&app_name=toutiao_web".format(max_behot_time=max_behot_time))
print(signature)
return signature
# https://www.toutiao.com/api/pc/list/user/feed?category=profile_all&token=MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc&max_behot_time=0&aid=24&app_name=toutiao_web
# https://www.toutiao.com/api/pc/list/user/feed?category=profile_all&token=MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc&max_behot_time=1639381182067&aid=24&app_name=toutiao_web
def get_detail(url):
time.sleep(random.randint(1,2))
print("进入详情页.....")
content = ""
response = requests.get(url=url,headers=headers_detail)
html = etree.HTML(response.text)
response.encoding = 'utf-8'
# print(response.text)
infos = html.xpath("//div[@class='article-content']/article/*")
print(len(infos))
if infos:
if len(infos) < 3:
for info in infos:
text = "".join(info.xpath(".//text()"))
# print(text)
content += text + "\n"
return content
else:
for info in infos[:-3]:
text = "".join(info.xpath(".//text()"))
# print(text)
content += text + "\n"
return content
else:
get_detail(url)
def get_master(max_behot_time,_signature):
time.sleep(random.randint(1,5))
response = requests.get(url=master_url.format(max_behot_time=max_behot_time,_signature=_signature))
infos = json.loads(response.text)['data']
# print(infos)
if infos:
for info in infos:
# print(info)
log_from = "".join(random.sample('abcdefghijklmnopqrstuvwxyz1234567890',13)).replace(' ','') + "_" + str(int(time.time() * 1000))
article_url = (info['url'].replace("group/","a") + "?log_from=" + log_from).replace("http://toutiao.com","https://www.toutiao.com")
title = info['title']
print(article_url)
print(title)
content = get_detail(article_url)
infos_list.append({'title':title,'content':content})
print(infos_list)
has_more = json.loads(response.text)['has_more']
if has_more:
max_behot_time = json.loads(response.text)['next']['max_behot_time']
new_signature = get_signature(max_behot_time)
get_master(max_behot_time,new_signature)
else:
get_master(max_behot_time,_signature)
if __name__ == "__main__":
_signature = get_signature(0)
get_master(0,_signature)
# for a_ in a:
# print(a_['url'].split("?")[0])
# content = get_detail("https://www.toutiao.com/a7042200035093709342/?log_from=3b88f64a8e013_1639703732731")
# content = get_detail("https://www.toutiao.com/a7038850085706318350/?log_from=ekrfb3hcd9xsq_1639705126626")
# print(content)
# infos_list.append({'title':a_['title'],'content':content})
# print(infos_list)
// 补环境(添加内容)
window = global;
window.document = {
referrer: "https://so.toutiao.com/"
};
window.location = {
hash: "",
host: "www.toutiao.com",
hostname: "www.toutiao.com",
href: "https://www.toutiao.com/c/user/token/MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc/??tab=article?tab=article?tab=article",
origin: "https://www.toutiao.com",
pathname: "/c/user/token/MS4wLjABAAAAq5QSjn8wn7A7Th30to72qb4CFuEAB8qNo2n0ux15Vhc/",
port: "",
protocol: "https:"
}
window.navigator = {
appCodeName: "Mozilla",
appName: "Netscape",
appVersion: "5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36",
cookieEnabled: true,
deviceMemory: 8,
doNotTrack: null,
hardwareConcurrency: 16,
language: "zh-CN",
languages: ["zh-CN","zh"],
maxTouchPoints: 0,
onLine: true,
platform: "Win32",
product: "Gecko",
productSub: "20030107",
userAgent:"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36",
vendor: "Google Inc.",
vendorSub: ""
}
var glb;
(glb = "undefined" == typeof window ? global : window)._$jsvmprt = function(b, e, f) {
function a() {
if ("undefined" == typeof Reflect || !Reflect.construct)
return !1;
if (Reflect.construct.sham)
return !1;
if ("function" == typeof Proxy)
return !0;
try {
return Date.prototype.toString.call(Reflect.construct(Date, [], (function() {}
))),
!0
} catch (b) {
return !1
}
}
function d(b, e, f) {
return (d = a() ? Reflect.construct : function(b, e, f) {
var a = [null];
a.push.apply(a, e);
var d = new (Function.bind.apply(b, a));
return f && c(d, f.prototype),
d
}
).apply(null, arguments)
}
function c(b, e) {
return (c = Object.setPrototypeOf || function(b, e) {
return b.__proto__ = e,
b
}
)(b, e)
}
function n(b) {
return function(b) {
if (Array.isArray(b)) {
for (var e = 0, f = new Array(b.length); e < b.length; e++)
f[e] = b[e];
return f
}
}(b) || function(b) {
if (Symbol.iterator in Object(b) || "[object Arguments]" === Object.prototype.toString.call(b))
return Array.from(b)
}(b) || function() {
throw new TypeError("Invalid attempt to spread non-iterable instance")
}()
}
for (var i = [], r = 0, t = [], o = 0, l = function(b, e) {
var f = b[e++]
, a = b[e]
, d = parseInt("" + f + a, 16);
if (d >> 7 == 0)
return [1, d];
if (d >> 6 == 2) {
var c = parseInt("" + b[++e] + b[++e], 16);
return d &= 63,
[2, c = (d <<= 8) + c]
}
if (d >> 6 == 3) {
var n = parseInt("" + b[++e] + b[++e], 16)
, i = parseInt("" + b[++e] + b[++e], 16);
return d &= 63,
[3, i = (d <<= 16) + (n <<= 8) + i]
}
}, u = function(b, e) {
var f = parseInt("" + b[e] + b[e + 1], 16);
return f = f > 127 ? -256 + f : f
}, s = function(b, e) {
var f = parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3], 16);
return f = f > 32767 ? -65536 + f : f
}, p = function(b, e) {
var f = parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3] + b[e + 4] + b[e + 5] + b[e + 6] + b[e + 7], 16);
return f = f > 2147483647 ? 0 + f : f
}, y = function(b, e) {
return parseInt("" + b[e] + b[e + 1], 16)
}, v = function(b, e) {
return parseInt("" + b[e] + b[e + 1] + b[e + 2] + b[e + 3], 16)
}, g = g || this || window, h = Object.keys || function(b) {
var e = {}
, f = 0;
for (var a in b)
e[f++] = a;
return e.length = f,
e
}
, m = (b.length,
0), I = "", C = m; C < m + 16; C++) {
var q = "" + b[C++] + b[C];
q = parseInt(q, 16),
I += String.fromCharCode(q)
}
if ("HNOJ@?RC" != I)
throw new Error("error magic number " + I);
m += 16;
parseInt("" + b[m] + b[m + 1], 16);
m += 8,
r = 0;
for (var w = 0; w < 4; w++) {
var S = m + 2 * w
, R = "" + b[S++] + b[S]
, x = parseInt(R, 16);
r += (3 & x) << 2 * w
}
m += 16,
m += 8;
var z = parseInt("" + b[m] + b[m + 1] + b[m + 2] + b[m + 3] + b[m + 4] + b[m + 5] + b[m + 6] + b[m + 7], 16)
, O = z
, E = m += 8
, j = v(b, m += z);
j[1];
m += 4,
i = {
p: [],
q: []
};
for (var A = 0; A < j; A++) {
for (var D = l(b, m), T = m += 2 * D[0], $ = i.p.length, P = 0; P < D[1]; P++) {
var U = l(b, T);
i.p.push(U[1]),
T += 2 * U[0]
}
m = T,
i.q.push([$, i.p.length])
}
var _ = {
5: 1,
6: 1,
70: 1,
22: 1,
23: 1,
37: 1,
73: 1
}
, k = {
72: 1
}
, M = {
74: 1
}
, H = {
11: 1,
12: 1,
24: 1,
26: 1,
27: 1,
31: 1
}
, J = {
10: 1
}
, N = {
2: 1,
29: 1,
30: 1,
20: 1
}
, B = []
, W = [];
function F(b, e, f) {
for (var a = e; a < e + f; ) {
var d = y(b, a);
B[a] = d,
a += 2;
k[d] ? (W[a] = u(b, a),
a += 2) : _[d] ? (W[a] = s(b, a),
a += 4) : M[d] ? (W[a] = p(b, a),
a += 8) : H[d] ? (W[a] = y(b, a),
a += 2) : J[d] ? (W[a] = v(b, a),
a += 4) : N[d] && (W[a] = v(b, a),
a += 4)
}
}
return K(b, E, O / 2, [], e, f);
function G(b, e, f, a, c, l, m, I) {
null == l && (l = this);
var C, q, w, S = [], R = 0;
m && (C = m);
var x, z, O = e, E = O + 2 * f;
if (!I)
for (; O < E; ) {
var j = parseInt("" + b[O] + b[O + 1], 16);
O += 2;
var A = 3 & (x = 13 * j % 241);
if (x >>= 2,
A < 1) {
A = 3 & x;
if (x >>= 2,
A > 2)
(A = x) > 10 ? S[++R] = void 0 : A > 1 ? (C = S[R--],
S[R] = S[R] >= C) : A > -1 && (S[++R] = null);
else if (A > 1) {
if ((A = x) > 11)
throw S[R--];
if (A > 7) {
for (C = S[R--],
z = v(b, O),
A = "",
P = i.q[z][0]; P < i.q[z][1]; P++)
A += String.fromCharCode(r ^ i.p[P]);
O += 4,
S[R--][A] = C
} else
A > 5 && (S[R] = h(S[R]))
} else if (A > 0) {
(A = x) > 8 ? (C = S[R--],
S[R] = typeof C) : A > 6 ? S[R] = --S[R] : A > 4 ? S[R -= 1] = S[R][S[R + 1]] : A > 2 && (q = S[R--],
(A = S[R]).x === G ? A.y >= 1 ? S[R] = K(b, A.c, A.l, [q], A.z, w, null, 1) : (S[R] = K(b, A.c, A.l, [q], A.z, w, null, 0),
A.y++) : S[R] = A(q))
} else {
if ((A = x) > 14)
z = s(b, O),
(U = function e() {
var f = arguments;
return e.y > 0 ? K(b, e.c, e.l, f, e.z, this, null, 0) : (e.y++,
K(b, e.c, e.l, f, e.z, this, null, 0))
}
).c = O + 4,
U.l = z - 2,
U.x = G,
U.y = 0,
U.z = c,
S[R] = U,
O += 2 * z - 2;