在爬取起点小说网上某本书的数据时发现使用xpath提取不到数据,打开网页源码发现字体被加密
进入到网页的response响应中发现数据存在,知识被开发人员使用加密文件加密了
接下来只需要找到对应的加密文件使用fontTools读取字体的映射关系表即可
import requests
import re
from fontTools.ttLib import TTFont
dict_data = {
'one':'1',
'two':'2',
'three':'3',
'four':'4',
'five':'5',
'six':'6',
'seven':'7',
'eight':'8',
'nine':'9',
'zero':'0',
'period':'.'
}
url = 'https://book.qidian.com/info/1033896966/'
headers = {
'cookie': '_yep_uuid=87677ede-f190-ca71-e511-ae598d86bcdf; hiijack=0; gender=male; COOKIE_BOOKLIST_TIPS=1; _csrfToken=QZ6tHB8qDaRr7JNXlb6lE3cbubY4bmMjQaSTjhaW; newstatisticUUID=1652260205_2015745767; fu=1298287543; _gid=GA1.2.1634780281.1652260206; _gat_gtag_UA_199934072_1=1; e1=%7B%22pid%22%3A%22mqd_P_qidianm%22%2C%22eid%22%3A%22mall_A1%22%2C%22l1%22%3A17%7D; e2=%7B%22pid%22%3A%22mqd_P_qidianm%22%2C%22eid%22%3A%22mqd_A64%22%2C%22l1%22%3A17%7D; _ga_D20NXNVDG2=GS1.1.1652260205.1.1.1652260342.0; _ga_VMQL7235X0=GS1.1.1652260205.1.1.1652260342.0; se_ref=baidu; _gat_gtag_UA_199934072_2=1; _ga_FZMMH98S83=GS1.1.1652260350.1.1.1652260354.0; _ga_PFYW0QLV3P=GS1.1.1652260350.1.1.1652260354.0; _ga=GA1.2.206320242.1652260206',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36'
}
res1 = requests.get(url,headers=headers)
# 获取到字数
num_str = re.findall('<span class=".*?">(.*?);</span></em><cite>万字',res1.content.decode())[0]
# 获取字体加密文件url
zt_url = re.findall(r"format\('eot'\); src: url\('(.*?)'\) format\('woff'\)",res1.content.decode())[0]
# 实时获取并保存字体加密文件
res2 = requests.get(zt_url,headers=headers)
with open('a.woff','wb') as f:
f.write(res2.content)
# 读取映射关系表
data = TTFont('a.woff')
font_data = data.getBestCmap()
# 处理数据
str_data_list = num_str.replace('&#','').split(';')
num = ''
for i in str_data_list:
value = font_data[int(i)]
num = num + dict_data[value]
print(num)
运行得到小说的总字数
本次爬取需要注意的是分析字体获取字体加密文件需要实时获取,获取后再用FontCreater进行第数字和英文之间的映射表获取,然后在使用第三方库获取英文和编码之间的表