起点女生网爬虫,主要反爬措施,字体混淆,16进制转换。
目标网站 https://www.qidian.com/mm/all
本爬取该网站40W+小说信息(不包含小说详细内容),存进mongodb
操作
1、主要对如下字段进行解释
2、在源码中找到如下字段的显示
3、面对这个结果,我们第一反应就是字体混淆,然后直接在源码中搜索woff,然后把该链接的文件下载到本地,然后利用TTFont查看其内容,最后发现内容如下
woff_fonts = TTFont('./fonts.woff')
woff_fonts.saveXML("./fonts.xml")
woff_fonts_content = woff_fonts.getBestCmap()
print woff_fonts_content
{'period': 100085, 'seven': 100079, 'nine': 100084, 'six': 100083, 'three': 100078, 'two': 100080, 'four': 100074, 'zero': 100076, 'eight': 100077, 'five': 100082, 'one': 100081}
这样一来就很好的发现,每一个内容都相互映射,只需要一一映射就可以得到最终的结果。
4、数字与英语的对应关系
digital_map_english = {
"one": "1",
"two": "2",
"three": "3",
"four": "4",
"five": "5",
"six": "6",
"seven": '7',
"eight": "8",
"nine": "9",
"zero": "0",
"period": ".",
}
5、我们将对应的网页解析出来,最后发现,结果好像不太一样
[u'\U00018803\U000187fa\U00018800\U000187fc\U000187fd\U00018803']
其实这是16进制的数字
我们可以把整个数组转换成字符串然后进行拆分
str_shuzi = str(shuzis)[3:-2] # 网页解析出来的结果
str_shuzi = r'{}'.format(str_shuzi)
list_shuzi = str_shuzi.split("\U000")
print list_shuzi
结果是
['', '18872', '1886d', '18874', '18873', '1886f', '18872']
6、这种格式的结果就是16进制的数字,我们再把这个16进制的结果转换成10进制
res_list = []
for i,shuzi in enumerate(list_shuzi):
if i is not 0:
sz16 = shuzi
sz10 = int("0x" + sz16, 16)
res_list.append((shuzi_map.get(int(sz10))))
print res_list
print "".join(res_list)
['3', '1', '9', '.', '6', '3']
319.63
转换成10进制之后,然后再到对应的关系中找到正确的数据。
完整代码如下
# coding=utf8
import requests
from fontTools.ttLib import TTFont
import re
from lxml import etree
class FontConversion(object):
def __init__(self, antispider_code):
self.digital_map_english = {
"one": "1",
"two": "2",
"three": "3",
"four": "4",
"five": "5",
"six": "6",
"seven": '7',
"eight": "8",
"nine": "9",
"zero": "0",
"period": ".",
}
self.headers = {
'Connection': "keep-alive",
'Pragma': "no-cache",
'Cache-Control': "no-cache",
'Upgrade-Insecure-Requests': "1",
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'Accept-Encoding': "gzip, deflate, br",
'Accept-Language': "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
}
def woff_html(antispider_code):
woff_url = "https://qidian.gtimg.com/qd_anti_spider/%s.woff" % antispider_code
woff_text = requests.get(woff_url, headers=self.headers).content
# path = "./qidian/utils/fonts"
path = "./fonts"
with open(path + ".woff", "wb") as f:
f.write(woff_text)
woff_fonts = TTFont(path + '.woff')
woff_fonts.saveXML(path + ".xml")
woff_fonts_content = woff_fonts.getBestCmap()
woff_fonts_content = dict([value, key] for key, value in woff_fonts_content.items())
shuzi_map = {}
for k in self.digital_map_english.keys():
shuzi_map[woff_fonts_content.get(k)] = self.digital_map_english.get(k)
return shuzi_map
self.shuzi_map = woff_html(antispider_code)
def font_conversion(self, font_face_text_list):
shuzis = font_face_text_list
str_shuzi = str(shuzis)[2:-1]
str_shuzi = r'{}'.format(str_shuzi)
list_shuzi = str_shuzi.split("\U000")
res_list = []
for i, shuzi in enumerate(list_shuzi):
if i is not 0:
sz16 = shuzi.replace("'", "")
sz10 = int("0x" + sz16, 16)
res_list.append((self.shuzi_map.get(int(sz10))))
return "".join(res_list)
if __name__ == "__main__":
# 测试的时候将woff_html中的path换一个
url = "https://book.qidian.com/info/1010119186"
r = requests.get(url)
antispider = re.findall("https://qidian\.gtimg\.com/qd_anti_spider/(.*?)\.eot\?", r.text)[0]
html = etree.HTML(r.text)
shuzis = html.xpath("//div[@class='book-info ']/p[3]/em[1]//span/text()")
font_tools = FontConversion(antispider)
res = font_tools.font_conversion(shuzis)
print res
上述代码仅仅是针对破解字体混淆,真正的爬虫是运用scrapy框架。
爬取的结果如下