Python爬虫之解决字体加密问题！字体加密确实难！

最新推荐文章于 2024-09-18 20:39:40 发布

爬遍天下无敌手

最新推荐文章于 2024-09-18 20:39:40 发布

阅读量2.8k

点赞数 4

文章标签：乱码字符串 python java web

原文链接：https://blog.csdn.net/qq_37481877/article/details/107885442?utm_medium=distribute.pc_category.none-task-blog-hot-5.nonecase&depth_1-utm_source=distribute.pc_category.none-task-blog-hot-5.nonecase&request_id=

版权

有些网站为了反爬，对网页中的一些数据进行了字体加密，用户浏览网页时显示的是正常的，但是爬取网页源代码时，却是乱码。

原因

页面在css中使用font-face定义了字符集，并通过unicode去映射展示，浏览器会加载css中的font字体为用户渲染好，所以浏览页面时是正常的，而对于爬虫来说却极其不友好，因为爬取下来的源代码未经过浏览器渲染，都是乱码。

解决办法

1. 查找到页面中的加密字体 font_face,常见的有两种情况

1.最简单的，直接在页面源码中搜索font_face,如果能找到类似如下信息，base64后面的字符串就是加密字体了。@font-face{font-family:swiper-icons;src:url(“data:application/font-woff;charset=utf-8;base64, d09GRgABAAAAAAZgABAAAAAADAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGRAAAfnP/u8EFTQAA”) format(“woff”);font-weight:400;font-style:normal}其中 d09GRgABAAAAAAZgABAAAAAADAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGRAAAfnP/u8EFTQAA 是加密字体。2.有些网页的加密字体不是直接明文写在代码中的，而是对加密字体进行了一次编码，在加载页面时再动态解码渲染页面，相当于又加了一次密，例如：

<script>
function font(d) {
let code = unescape("%64%2E%77%72%69%74%65%28%22%3C%73%74%79%6C%65%3E%40%66%6F%6E%74%2D%66%61%63%65%7B%66%6F%6E%74%");
eval(code);//执行code
}(document);
</script>

解密后的code是一段js，如下：

d.write("<style>@font-face{font-family:'cyzone-secret';src:url('data:application/font-ttf;charset=utf-8;base64,AAEAAA

其中base64,后面的字符串就是加密字体

2.base64解密保存字体文件

得到加密字体字符串后，进行base64解密，存储为.ttf文件。

binData = base64.decodebytes(font_str.encode())
with open('data/xxx.ttf', 'wb') as f:
f.write(binData)
f.close()

3.加载字体文件得到映射关系，进行解密

# 加载字体生成映射关系
    font = TTFont(path)
    cmap = font.getBestCmap()
    #解密
    def to_num(self, getText):
        try:
            retList = []
            for ch in getText:                 # ord()以字符作为参数，返回对应的Unicode数值
                if ord(ch) in self.c:
                    retList.append(int(self.c[ord(ch)][-2:]) - 1)
                else:
                    retList.append(ch)
            crackText = ''
            for num in retList:
                crackText += str(num)
                return crackText
        except:
            return getText

注意

有时，网站的加密字体的固定的只要拿到一次就可以重复使用了，但有时每发送一次请求，加密字体都不一样，这时就需要每次请求后重新解析得到加密字体，下面的代码就是每次请求时加密字体都不一样的示例。

示例代码

import base64
import json
import re
import js2py # 用于执行js脚本
import requests
import xlwt
from bs4 import BeautifulSoup
from fontTools.ttLib import TTFont

# 解密字体工具类
class FontToNum:
    def __init__(self, path, font_str):
        binData = base64.decodebytes(font_str.encode())
        with open(path, 'wb') as f:
            f.write(binData)
            f.close()
        self.font01 = TTFont(path)
        self.c = self.font01.getBestCmap()

    def to_num(self, getText):
        try:
            retList = []
            for ch in getText:
                # ord()以字符作为参数，返回对应的Unicode数值
                if ord(ch) in self.c:
                    retList.append(int(self.c[ord(ch)][-2:]) - 1)
                else:
                    retList.append(ch)
            crackText = ''
            for num in retList:
                crackText += str(num)
            return crackText
        except:
            return getText


fw_log = open('data/xxx_log.txt', mode='w', encoding='utf-8')
fw = open('data/xxx.txt', mode='w', encoding='utf-8')

book = xlwt.Workbook()
sheet = book.add_sheet('sheet1')
sheet.write(0, 0, '字段1')
sheet.write(0, 1, '加密了的字段2')
sheet.write(0, 2, '字段3')
sheet.write(0, 3, '字段4')
sheet.write(0, 4, '字段5')
sheet.write(0, 5, '加密了的字段6')
sheet.write(0, 6, '字段7')

url = 'https://xxx/xxx/xxx'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
}
raw = 1
for i in range(1, 21):
    for j in range(5):
        try:
            res = requests.get(url.format(str(i)), headers=headers, timeout=(60, 66))
            res.encoding = 'utf-8'
            soup = BeautifulSoup(res.text, 'lxml')
            tr_list = soup.select('[class="table-plate3"]')

	    # 获取加密字体字符串
            font_js = re.search("unescape\(\"(.*?)\"\);", res.text).group(0) # 得到生成加密字体的js代码
            font_css = js2py.eval_js(font_js) # 执行js代码
            font_str = re.search("base64,(.*?)'", font_css).group(1) # 拿到加密字体字符串

            # 解码工具类
            font_to_num = FontToNum('data/font/' + str(i) + '.ttf', font_str)

            result = {}
            for tr in tr_list:
                name = tr.select_one('[class=tp2_com]').text.strip()
                money = tr.select_one('[class=tp-mean]').text.strip()
                touzilunci = tr.select('td')[3].text.strip()
                touzifang = tr.select('td')[4].text.strip()
                hangye = tr.select_one('[class=tp3]').text.strip()
                time = tr.select('td')[6].text.strip()
                detaio_href = tr.select_one('[class=show-detail]').get('href').strip()

                result['字段1'] = name
                result['加密了的字段2'] = font_to_num.to_num(money)
                result['字段3'] = touzilunci
                result['字段4'] = touzifang
                result['字段5'] = ','.join([v for v in hangye.split(' ') if v])
                result['加密了的字段6'] = font_to_num.to_num(time)
                result['字段7'] = detaio_href

                sheet.write(raw, 0, name)
                sheet.write(raw, 1, font_to_num.to_num(money))
                sheet.write(raw, 2, touzilunci)
                sheet.write(raw, 3, touzifang)
                sheet.write(raw, 4, ','.join([v for v in hangye.split(' ') if v]))
                sheet.write(raw, 5, font_to_num.to_num(time))
                sheet.write(raw, 6, detaio_href)

                fw.write(json.dumps(result) + '\n')
                raw += 1
            break
        except Exception as e:
            print(e)
         
            continue
fw_log.close()
fw.close()
book.save('data/xxx.xls')

学会了吗