字体反爬绕过

最新推荐文章于 2024-09-09 18:34:56 发布

让我look一下-_-

最新推荐文章于 2024-09-09 18:34:56 发布

阅读量270

点赞数 5

分类专栏： Spider 文章标签：开发语言 python matplotlib

本文链接：https://blog.csdn.net/weixin_56267866/article/details/141157570

版权

Spider 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

目标

url = "aHR0cHM6Ly93d3cubWFveWFuLmNvbS9maWxtcz9zaG93VHlwZT0xJm9mZnNldD0w"

反爬手段

在这里插入图片描述
从图片中可以看到票房的数据使用了字体反爬，原本的30.14亿变成了#xe583;#xeba2;.#xed8f;#xf16b;字符。

破解方法

破解思路

字体反爬的原理是将每个字符的轮廓数据映射成一个特定的字符， 如#xe583;#xeba2;.#xed8f;#xf16b...,网页通过css来指定字体，渲染时页面根据字体找到对应的字体文件，从而找到对应的字符。因此我们要破解这一反爬只需要找到对应的字体文件，然后根据映射关系找到字符数据，通过获取到的轮廓数据来判断具体的字符。

字体文件

常用的字体文件类型为： woff，woff2,  ttf
本网站使用的是woff文件， 可以看到加载的字体文件以及在目标页面渲染的字体文件的地址：

在这里插入图片描述

根据字体文件判断字符

# 1. 下载字体文件
# 直接在浏览器下载或使用request.get下载
import requests
resp = requests.get(url)
resp.raise_for_status() 
with open(save_path, "wb") as f:
   f.write(resp.content)

# 2. fontTools库
# fontTools库可以将字体文件转为xml文件，可以更直观的看到字体轮廓数据，也用来提取操作字体数据。
from fontTools.ttlib import TTFont
font = TTFont('movie.woff')
font.saveXML("movie.xml") # 保存为xml文件

glyf = font['glyf'] # 获取glyf表， 轮廓数据在这个表中
glyph = glyf.get("xe583") # 获取xe583对应的数据
data = glyph.coordinates.array # 轮廓数据

解决办法

 书中直接将所有字形数据进行hash，但不同字体轮廓数据不同，
 轮廓的点的个数，点的位置这些都会导致hash存在较大区别，
 但最终呈现的字形区别是很小的，因此可以通过将字体中的轮廓
 数据绘制出来在比较绘制好的图片之间的区别。python 中的imagehash.phash(img)会生成一个hash值，
 可以通过hash值的差值来判断两张图片是否相似：
1. 0-5 非常相似
2. 5-20 有点相似
3. > 20 差异很大

	    def gen_font_hash(self, coordinates):
        """
        coordinates: 字符轮廓坐标[(x1,y1),(x2,y2)]
        # 不同字体虽然轮廓坐标不一样，但最终的字符是基本一样的，所以通过将字符绘制出来，获取该字符图片的hash值来比较是否为同一字符
        """
        x = [i[0] for i in coordinates]
        y = [i[1] for i in coordinates]
        plt.fill(x, y, 'g')  # 使用绿色填充
        plt.plot(x, y, 'g--')  # 绘制多边形的轮廓线，使用绿色虚线
        plt.grid(True)
        # 渲染图像到缓冲区，这样不需要保存图片文件
        buf = io.BytesIO() 
        plt.savefig(buf, format='png')
        buf.seek(0)  # 将指针移回流的开头
        img = Image.open(buf)
        # plt.show() # 显示图片,在生成base_hash的时候，用于确认hash值对应的字符
        # 计算图像的pHash值
        plt.close()
        phash = imagehash.phash(img)
        return phash

总结

对于固定一种字体可以之间通过观察直接标记出各映射关系
对于多种字体动态随机字体按以下流程：
1. 下载字体文件
2.  获取每个字符的轮廓数据（font['glyf'].get(w).coordinates.array）
3.  获取每个字符的hash值
4.  基准字形： 随便选取一种字体，获取每个字符的hash值并获取其映射关系。
5.  将hash值与基准字符的hash值进行比较从而找到字符之间的映射

问题

1. 需要将字符绘制出来需要用到matplotlib， PIL两个个大型的库，当要转换的字符过多，速度会受到影响。（有更好思路的可以讨论一下。）

2. 适用于简单字形1-9a-z， 若出现多种复杂字形，基准字形需要一个一个字符手动找映射关系比较麻烦。

代码

from fontTools.ttLib import TTFont
import io
import matplotlib.pyplot as plt
from PIL import Image
import imagehash


class ReflectFont:
    """
    不同字体的轮廓坐标存在差异，但最终的字形基本一致，因此通过轮廓的坐标绘制出字形，根据字形的hash值来判断字符
    """
    def __init__(self, font_path):
        self.font = TTFont(font_path)
        self.base_hash = {
            "0": "ea87b598c20f9e70", 
            "1": "eb3696c994393466",
            "2": "e8c894b5c75a926e",
            "3": "e9a5b4dac3259a49",
            "4": "ede5921a98b3638c",
            "5": "f830a5cbc3349ed2",
            "6": "b8a5c19a97259ed8",
            "7": "e36e98b38684cdc9",
            "8": "f8a1879e87699ac8",
            "9": "e88f9570c28f9b70"
        }
    def gen_base_font(self):
        """
        使用机器学习或其他方法来自动生成基准字体
        """
        raise NotImplementedError("Not implemented")
        
    def gen_font_hash(self, coordinates):
        """
        coordinates: 字符轮廓坐标[(x1,y1),(x2,y2),...]
        # 不同字体虽然轮廓坐标不一样，但最终的字符是基本一样的，所以通过将字符绘制出来，获取该字符图片的hash值来比较是否为同一字符
        """
        x = [i[0] for i in coordinates]
        y = [i[1] for i in coordinates]
        plt.fill(x, y, 'g')  # 使用绿色填充
        plt.plot(x, y, 'g--')  # 绘制多边形的轮廓线，使用绿色虚线
        plt.grid(True)
        # 渲染图像到缓冲区，这样不需要保存图片文件
        buf = io.BytesIO() 
        plt.savefig(buf, format='png')
        buf.seek(0)  # 将指针移回流的开头
        img = Image.open(buf)
        # plt.show() # 显示图片,在生成base_hash的时候，用于确认hash值对应的字符
        # 计算图像的pHash值
        plt.close()
        phash = imagehash.phash(img)
        return phash


    def search_similar_str(self, char_hash, threshold=5):
        """
        根据字符的pHash值，搜索相似的字符
        :param char_hash: 字符的pHash值
        :param threshold: 阈值，默认为5
        :return: 返回最相似的字符
        """
        similar_str = "?"
        # 获取hash值差最小的字符这样可以将threshold设置大一点，可以避免字体差异过大导致找不到映射关系。
        smallest_coefficient = 100
        for k, v in self.base_hash.items():
            # imagehash.hex_to_hash(v)构建一个phash对象来运算
            correlation_coefficient = abs(char_hash-imagehash.hex_to_hash(v))
            if correlation_coefficient < threshold:
                if correlation_coefficient < smallest_coefficient:
                    smallest_coefficient = correlation_coefficient
                    similar_str = k
        return similar_str

    
    def reflect_to_str(self, data):
        """
        :param data: 待转换的字符数组
        :return: 转换后的字符串
        """
        result = []
        for pre_char in data:
            # 从字体文件中取出对应编码的字形信息
            glyf = self.font['glyf']
            glyph = glyf.get(pre_char)
            if glyph:
                coordinates = list(glyph.coordinates)
                char_hash = self.gen_font_hash(coordinates)
                char = self.search_similar_str(char_hash)
            else:
            	# 字体文件中没有的字符使用原数值代替
                char = chr(int(pre_char.replace("uni", "0x"), 16))
            result.append(char)
        return result
    

def test():
    font1 = ReflectFont('432017e7.woff')
    names1 = ["uniE83F","uniE85F","uniE916","uniED4F","uniED98","uniEDBA","uniEFE9","uniF0F0","uniF70E","uniF7B3"]
    font2 = ReflectFont('75e5b39d.woff')
    names2 = ["uniE1B7","uniE274","uniE317","uniE5AC","uniE6D5","uniEAB3","uniEC68","uniEF74","uniF615","uniF66D"]
    font3 = ReflectFont('2a70c44b.woff')
    names3 = ["uniF05A","uniE132","uniE583","uniE83D","uniE886","uniEBA2","uniEC4B","uniED8F","uniF16B","uniF23F"]
    print(font1.reflect_to_str(names1))
    print(font2.reflect_to_str(names2))
    print(font3.reflect_to_str(names3))

if __name__ == '__main__':
    test()

让我look一下-_-

关注

5
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
字体反爬绕过

对于固定一种字体可以之间通过观察直接标记出各映射关系对于多种字体动态随机字体按以下流程：1. 下载字体文件2. 获取每个字符的轮廓数据（font['glyf'].get(w).coordinates.array）3. 获取每个字符的hash值4. 基准字形：随便选取一种字体，获取每个字符的hash值并获取其映射关系。5. 将hash值与基准字符的hash值进行比较从而找到字符之间的映射。
复制链接

扫一扫