Python +tensorflow+pygame 破解任意字体反爬

什么是字体反爬?

每个字符在 都可以用 unicode 编码表示  而字体文件可以理解为Unicode 和 字体形状的映射 ,所以在计算机中字符可以变成我们人类所能理解的形状,所以字体反爬的关键就是字体文件,因为它决定了将Unicode字符渲染成什么形状(字)

1.解析反爬效果

这里我们拿猫眼为例:

第二图可见,猫眼将数字进行了反爬,&#x  表示16进制 ,e309 表示Unicode 的值,第一幅图中可见,浏览器字体文件渲染的效果和默认渲染的效果,爬虫只能抓到̉这种原始Unicode 或默认渲染的 . , 而真实的数字就需要字体文件了。

 

2.字体文件获取

 

从上图可以看到,两个箭头之间就是字体文件了,我们只需将其保存至本地即可

import base64 
font_face='d09GRgABAAAAAAggAAsAAAAAC7gAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABHU1VCAAABCAAAADMAAABCsP6z7U9TLzIAAAE8AAAARAAAAFZW7ldeY21hcAAAAYAAAAC8AAACTC79iqhnbHlmAAACPAAAA5EAAAQ0l9+jTWhlYWQAAAXQAAAALwAAADYVJRd8aGhlYQAABgAAAAAcAAAAJAeKAzlobXR4AAAGHAAAABIAAAAwGhwAAGxvY2EAAAYwAAAAGgAAABoGmAWgbWF4cAAABkwAAAAfAAAAIAEZADxuYW1lAAAGbAAAAVcAAAKFkAhoC3Bvc3QAAAfEAAAAXAAAAI8LScOueJxjYGRgYOBikGPQYWB0cfMJYeBgYGGAAJAMY05meiJQDMoDyrGAaQ4gZoOIAgCKIwNPAHicY2Bk0mWcwMDKwMHUyXSGgYGhH0IzvmYwYuRgYGBiYGVmwAoC0lxTGBwYKr4/Z9b5r8MQw6zDcAUozAiSAwDoeQvweJzFkrENg0AMRf8FQgKkSJkhsgM1EhKDsAATpKLMFJSpMgcjAOIKKJAoKMm/M00kaBOf3kn+PtmWfQCOABxyJy6g3lAw9qKqrO4gsLqLB/0brlTOyOuq8dusS3TRx0M0plM562Xhi/3Ililm3DomEuLA/lyc4MFnTenE28n0A1P/K/1tF3s/Vy8k+QpbrCvB6I0vcJJoM8HsvEsEs3NdCJwz+ljgxDFEAmePMRW4BUylYP7NrAUEH1DvQih4nD2Ty28aVxTG7x0ixsEY4zKPgBNgGDwzgG3G88LAGAgYEj8pNoMxToixEkLcJnGtOHUSq03oQ0qq/gHpplIX3URdZJ9KVbNqU7Ve9A+o1G13rZSNhXtn7DCLK50rnfN95/fNBRCA43+ABAiAAZCQScJPCAB9mHnAI+w34ABDADAqo0J5RCZZkh+xwULvV1i63GrV/3pRgYc9sfLiCN39eNJ3/B8GsD8AC+Jooh/KUgYmMnAGKrwdt7MhTlU0WfJDknBBNsRzPFQ4NmQnCZqStK8GdTGa5l12HHrj44n1h59uze7p6ftlQ9EcsLMyna5Fog/KP+jqWEb1aaMDZ+xRn+/x9u0vFr7uPvvOmIwbML243lwuRWJr4J0f2EN+gmAcbcTxSAy34y5IZpA16sQF8pRApmiKhoTpWFMVLmSH3zjJsBINRmnnUHBDXjtIXc/febZY+MjQVGfvOV/ktEr5fhWjFHqMDiQvrGpTk9124d7Mt68OmyviZLX3ZtyINZbm1moAWlzOYD+DwCkVzVob5z0MyeCnXkw+iMiXjjktV68VYgVitQiv9/7mg7Ns80my+PHWTGbgdTG/9bzGBRxwp/oTRT+5uXl5TZtuAGDrsyeQygQAHpO1NdfEbUejUZ0gKFlKaOaCNoKiUaGdVC8/3Hm1u50vdv+8mCuJeUVkmUL74vnQWCgSlMlI9ZMK/FzYfv/W3cWOQF3LXz3I6K1S83slGww0C7neU75IeEiCf7xS6XM/OvUCPErCXBEBRm5M+uaeyImkuXkO5RH1+jrLu+nzbrfTNXqjfFMvNSoPV6PCo/AEbHXnl6sb0Zx+O9vml1fn629e3tuDm+mUnAcW13/hMdKJ9blaESdoK0ULsWaF64dI2QKBAue7w5e0jMFHdF/Y4UquZzV51lF3J1PVlDSlSlPZS0871w7O/rKQrx3wgmMJpmfEbCY/3IhP+c7VNxeo4Sulq5/tNKxnYh17yIMTvRR2xAVxVUP7JmS4Vw92hLnpUWEwiYl+3W2EJK9I93veop4wAGMkg/jYzJZ3f0UGSqekcJcNh297/KBjVEhyqTIZWdCzi7Bxdv/3fSZGFERBot8bqFYDfm88rgbF+QvTN+bmS472rV1jYkmiswIzcY4e6mseY6+BByWiMiSabMdZU9WUjsNDtjAre7wDG3DEHUj7cwx2xyiGWw8e5RofRNv6/t3kFQ6A/wHUBeDCAAAAeJxjYGRgYABiXqF/ofH8Nl8ZuFkYQOAm08tHCPr/GxYGpvNALgcDE0gUACrwCzkAeJxjYGRgYNb5r8MQw8IAAkCSkQEV8AAAM2IBzXicY2EAghQGBiYd4jAAN4wCNQAAAAAAAAAMACgAcAC0AOYBLAFgAaIBvAH2AhoAAHicY2BkYGDgYTBgYGYAASYg5gJCBob/YD4DAA6DAVYAeJxlkbtuwkAURMc88gApQomUJoq0TdIQzEOpUDokKCNR0BuzBiO/tF6QSJcPyHflE9Klyyekz2CuG8cr7547M3d9JQO4xjccnJ57vid2cMHqxDWc40G4Tv1JuEF+Fm6ijRfhM+oz4Ra6eBVu4wZvvMFpXLIa40PYQQefwjVc4Uu4Tv1HuEH+FW7i1mkKn6Hj3Am3sHC6wm08Ou8tpSZGe1av1PKggjSxPd8zJtSGTuinyVGa6/Uu8kxZludCmzxMEzV0B6U004k25W35fj2yNlCBSWM1paujKFWZSbfat+7G2mzc7weiu34aczzFNYGBhgfLfcV6iQP3ACkSaj349AxXSN9IT0j16JepOb01doiKbNWt1ovippz6sVYYwsXgX2rGVFIkq7Pl2PNrI6qW6eOshj0xaSq9mpNEZIWs8LZUfOouNkVXxp/d5woqebeYIf4D2J1ywQB4nG2JOw6AIBQE3+IHRbyLAQJaAsJdbOxMPL7x2TrNZHZJ0IeifzQEGrTo0ENiwAiFCRoz4ZbXeRS7bK+rjZHbu8x2PrGT47+Elfe6uMqdLbuGErjNbogeH2oXtw=='
b=base64.b64decode(font_face)
with open('myfont.otf','wb')as f:
    f.write(b)

3.获取unicode 与字体文件的映射关系

from fontTools.ttLib import TTFont
ttffont=TTFont("myfont.otf")
print(ttffont.getBestCmap())

#{120: 'x', 58066: 'uniE2D2', 58121: 'uniE309', 58475: 'uniE46B', 58956: 'uniE64C', 59276: 'uniE78C',60233: 'uniEB49', 60479: 'uniEC3F', 61519: 'uniF04F', 62378: 'uniF3AA', 63463: 'uniF7E7'}

这是字体文件中的映射

而网页的字体为:.  之类的16位Unicode,我们只需将其转换成10进制的即可

import re
s="."
n16s=re.findall("&#x(.*?);",s)
for n16 in n16s:
    n10=int("0x"+n16,16)
    print(n10)
# 58121
# 60479
# 58121
# 61519

这样就可以转换成字体文件中可以识别的 unicode 了

4.根据字体文件渲染unicode

这里我们用到pygame 去将unicode 渲染成图片,首先我们要先观察网页用的字体文件的类型

从图中可知猫眼电影所用的字体文件类型为woff ,通过FreeType的支持的所有字体文件格式可以通过渲染 pygame.freetype,即TTF,Type1和CFF,OpenType字体, SFNTPCFFNTBDFPFR和Type42字体。可以访问具有UTF-32代码点的所有字形,

pygame不支持woff ,所以这里用个函数进行转换一下,具体代码如下:

import pygame.freetype
from PIL import Image
from io import BytesIO
import base64 
import struct
import sys
import zlib
def convert_streams(infile):
    infile=BytesIO(infile)
    outfile=BytesIO()
    WOFFHeader = {'signature': struct.unpack(">I", infile.read(4))[0],
                  'flavor': struct.unpack(">I", infile.read(4))[0],
                  'length': struct.unpack(">I", infile.read(4))[0],
                  'numTables': struct.unpack(">H", infile.read(2))[0],
                  'reserved': struct.unpack(">H", infile.read(2))[0],
                  'totalSfntSize': struct.unpack(">I", infile.read(4))[0],
                  'majorVersion': struct.unpack(">H", infile.read(2))[0],
                  'minorVersion': struct.unpack(">H", infile.read(2))[0],
                  'metaOffset': struct.unpack(">I", infile.read(4))[0],
                  'metaLength': struct.unpack(">I", infile.read(4))[0],
                  'metaOrigLength': struct.unpack(">I", infile.read(4))[0],
                  'privOffset': struct.unpack(">I", infile.read(4))[0],
                  'privLength': struct.unpack(">I", infile.read(4))[0]}

    outfile.write(struct.pack(">I", WOFFHeader['flavor']));
    outfile.write(struct.pack(">H", WOFFHeader['numTables']));
    maximum = list(filter(lambda x: x[1] <= WOFFHeader['numTables'], [(n, 2**n) for n in range(64)]))[-1]; 
    searchRange = maximum[1] * 16
    outfile.write(struct.pack(">H", searchRange));
    entrySelector = maximum[0]
    outfile.write(struct.pack(">H", entrySelector));
    rangeShift = WOFFHeader['numTables'] * 16 -  searchRange;
    outfile.write(struct.pack(">H", rangeShift));

    offset = outfile.tell()

    TableDirectoryEntries = []
    for i in range(0, WOFFHeader['numTables']):
        TableDirectoryEntries.append({'tag': struct.unpack(">I", infile.read(4))[0],
                               'offset': struct.unpack(">I", infile.read(4))[0],
                               'compLength': struct.unpack(">I", infile.read(4))[0],
                               'origLength': struct.unpack(">I", infile.read(4))[0],
                               'origChecksum': struct.unpack(">I", infile.read(4))[0]})
        offset += 4*4
        
    for TableDirectoryEntry in TableDirectoryEntries:   
        outfile.write(struct.pack(">I", TableDirectoryEntry['tag']))
        outfile.write(struct.pack(">I", TableDirectoryEntry['origChecksum']))
        outfile.write(struct.pack(">I", offset))
        outfile.write(struct.pack(">I", TableDirectoryEntry['origLength']))
        TableDirectoryEntry['outOffset'] = offset
        offset += TableDirectoryEntry['origLength']
        if (offset % 4) != 0:
            offset += 4 - (offset % 4)
            
    for TableDirectoryEntry in TableDirectoryEntries:
        infile.seek(TableDirectoryEntry['offset'])
        compressedData = infile.read(TableDirectoryEntry['compLength'])
        if TableDirectoryEntry['compLength'] != TableDirectoryEntry['origLength']:
            uncompressedData = zlib.decompress(compressedData)
        else:
            uncompressedData = compressedData
        outfile.seek(TableDirectoryEntry['outOffset'])
        outfile.write(uncompressedData)
        offset = TableDirectoryEntry['outOffset'] + TableDirectoryEntry['origLength'];
        padding = 0
        if (offset % 4) != 0:
            padding = 4 - (offset % 4)
        outfile.write(bytearray(padding));
    return outfile.getvalue()

font_face='d09GRgABAAAAAAggAAsAAAAAC7gAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABHU1VCAAABCAAAADMAAABCsP6z7U9TLzIAAAE8AAAARAAAAFZW7ldeY21hcAAAAYAAAAC8AAACTC79iqhnbHlmAAACPAAAA5EAAAQ0l9+jTWhlYWQAAAXQAAAALwAAADYVJRd8aGhlYQAABgAAAAAcAAAAJAeKAzlobXR4AAAGHAAAABIAAAAwGhwAAGxvY2EAAAYwAAAAGgAAABoGmAWgbWF4cAAABkwAAAAfAAAAIAEZADxuYW1lAAAGbAAAAVcAAAKFkAhoC3Bvc3QAAAfEAAAAXAAAAI8LScOueJxjYGRgYOBikGPQYWB0cfMJYeBgYGGAAJAMY05meiJQDMoDyrGAaQ4gZoOIAgCKIwNPAHicY2Bk0mWcwMDKwMHUyXSGgYGhH0IzvmYwYuRgYGBiYGVmwAoC0lxTGBwYKr4/Z9b5r8MQw6zDcAUozAiSAwDoeQvweJzFkrENg0AMRf8FQgKkSJkhsgM1EhKDsAATpKLMFJSpMgcjAOIKKJAoKMm/M00kaBOf3kn+PtmWfQCOABxyJy6g3lAw9qKqrO4gsLqLB/0brlTOyOuq8dusS3TRx0M0plM562Xhi/3Ililm3DomEuLA/lyc4MFnTenE28n0A1P/K/1tF3s/Vy8k+QpbrCvB6I0vcJJoM8HsvEsEs3NdCJwz+ljgxDFEAmePMRW4BUylYP7NrAUEH1DvQih4nD2Ty28aVxTG7x0ixsEY4zKPgBNgGDwzgG3G88LAGAgYEj8pNoMxToixEkLcJnGtOHUSq03oQ0qq/gHpplIX3URdZJ9KVbNqU7Ve9A+o1G13rZSNhXtn7DCLK50rnfN95/fNBRCA43+ABAiAAZCQScJPCAB9mHnAI+w34ABDADAqo0J5RCZZkh+xwULvV1i63GrV/3pRgYc9sfLiCN39eNJ3/B8GsD8AC+Jooh/KUgYmMnAGKrwdt7MhTlU0WfJDknBBNsRzPFQ4NmQnCZqStK8GdTGa5l12HHrj44n1h59uze7p6ftlQ9EcsLMyna5Fog/KP+jqWEb1aaMDZ+xRn+/x9u0vFr7uPvvOmIwbML243lwuRWJr4J0f2EN+gmAcbcTxSAy34y5IZpA16sQF8pRApmiKhoTpWFMVLmSH3zjJsBINRmnnUHBDXjtIXc/febZY+MjQVGfvOV/ktEr5fhWjFHqMDiQvrGpTk9124d7Mt68OmyviZLX3ZtyINZbm1moAWlzOYD+DwCkVzVob5z0MyeCnXkw+iMiXjjktV68VYgVitQiv9/7mg7Ns80my+PHWTGbgdTG/9bzGBRxwp/oTRT+5uXl5TZtuAGDrsyeQygQAHpO1NdfEbUejUZ0gKFlKaOaCNoKiUaGdVC8/3Hm1u50vdv+8mCuJeUVkmUL74vnQWCgSlMlI9ZMK/FzYfv/W3cWOQF3LXz3I6K1S83slGww0C7neU75IeEiCf7xS6XM/OvUCPErCXBEBRm5M+uaeyImkuXkO5RH1+jrLu+nzbrfTNXqjfFMvNSoPV6PCo/AEbHXnl6sb0Zx+O9vml1fn629e3tuDm+mUnAcW13/hMdKJ9blaESdoK0ULsWaF64dI2QKBAue7w5e0jMFHdF/Y4UquZzV51lF3J1PVlDSlSlPZS0871w7O/rKQrx3wgmMJpmfEbCY/3IhP+c7VNxeo4Sulq5/tNKxnYh17yIMTvRR2xAVxVUP7JmS4Vw92hLnpUWEwiYl+3W2EJK9I93veop4wAGMkg/jYzJZ3f0UGSqekcJcNh297/KBjVEhyqTIZWdCzi7Bxdv/3fSZGFERBot8bqFYDfm88rgbF+QvTN+bmS472rV1jYkmiswIzcY4e6mseY6+BByWiMiSabMdZU9WUjsNDtjAre7wDG3DEHUj7cwx2xyiGWw8e5RofRNv6/t3kFQ6A/wHUBeDCAAAAeJxjYGRgYABiXqF/ofH8Nl8ZuFkYQOAm08tHCPr/GxYGpvNALgcDE0gUACrwCzkAeJxjYGRgYNb5r8MQw8IAAkCSkQEV8AAAM2IBzXicY2EAghQGBiYd4jAAN4wCNQAAAAAAAAAMACgAcAC0AOYBLAFgAaIBvAH2AhoAAHicY2BkYGDgYTBgYGYAASYg5gJCBob/YD4DAA6DAVYAeJxlkbtuwkAURMc88gApQomUJoq0TdIQzEOpUDokKCNR0BuzBiO/tF6QSJcPyHflE9Klyyekz2CuG8cr7547M3d9JQO4xjccnJ57vid2cMHqxDWc40G4Tv1JuEF+Fm6ijRfhM+oz4Ra6eBVu4wZvvMFpXLIa40PYQQefwjVc4Uu4Tv1HuEH+FW7i1mkKn6Hj3Am3sHC6wm08Ou8tpSZGe1av1PKggjSxPd8zJtSGTuinyVGa6/Uu8kxZludCmzxMEzV0B6U004k25W35fj2yNlCBSWM1paujKFWZSbfat+7G2mzc7weiu34aczzFNYGBhgfLfcV6iQP3ACkSaj349AxXSN9IT0j16JepOb01doiKbNWt1ovippz6sVYYwsXgX2rGVFIkq7Pl2PNrI6qW6eOshj0xaSq9mpNEZIWs8LZUfOouNkVXxp/d5woqebeYIf4D2J1ywQB4nG2JOw6AIBQE3+IHRbyLAQJaAsJdbOxMPL7x2TrNZHZJ0IeifzQEGrTo0ENiwAiFCRoz4ZbXeRS7bK+rjZHbu8x2PrGT47+Elfe6uMqdLbuGErjNbogeH2oXtw=='
b=base64.b64decode(font_face)
myfont=BytesIO(convert_streams(b))

uni=58121

pygame.freetype.init()
font=pygame.freetype.Font(myfont,64)
rtext=font.render(chr(uni), (0, 0, 0),(255, 255,255))
pil_string_image = pygame.image.tostring(rtext[0], "RGB")
pil_image = Image.frombytes("RGB",rtext[0].get_size(),pil_string_image)
pil_image.show()

运行此段代码可以将  unicode  渲染成图片,

5.利用tensorflow 的cnn 卷石神经网络,训练模型 识别图片中的字符

接下来就是让计算机将图片识别为字符就好了,我们可以从网上下载一个或多个,全字符的字体文件来训练,然后用pygame 来生成样本, 接下来就是TensorFlow 的训练样本的代码:

import numpy as np
import tensorflow as tf
import pygame
import random
from PIL import Image
import pygame.freetype
from io import BytesIO
from io import StringIO
from fontTools.ttLib import TTFont
pygame.init()
sjs=[0,
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
    
    16, 17, 18, 19, 20, 21, 22,23, 24, 25, 26, 27, 28, 29, 30,
    31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
    43, 44, 45,315, 316, 317, 318, 319, 320, 321, 322, 
    323, 324, 325, 326, 327, 328,329, 
    330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 
    340, 341, 342, 343, 344,

    345, 346,  347, 348, 349, 
    350, 351, 352, 353, 354, 355, 356, 357, 358, 359,
    ]
kb=0.75
ttf_names=["e:syc.otf",
        # "e:syz.otf","e:o1.otf","e:o2.otf","e:o3.otf",
        # "e:o4.otf","e:o5.otf","e:o6.otf","e:o7.otf",
        # "e:t1.ttf","e:t2.ttf","e:t3.ttf","e:t4.ttf",
        ]
ttf_name="e:syc.otf"
def change_ttf(ttf_name):
    global font
    print(ttf_name)
    # font=pygame.freetype.Font(ttf_name,random.randint(60,64))
    font=pygame.freetype.Font(ttf_name,64)
ttffont=TTFont(ttf_name)
gbcs=list(ttffont.getBestCmap())
gbc=[]
for k in gbcs:
    if 33<=k<=126 or 19968<=k<=40869:
        gbc.append(k)
# gbc=gbc[500:507]
sgbc=sorted(gbc)
print("gggggg")

IMAGE_HEIGHT = 64
IMAGE_WIDTH =  64
def k2name(k):
    s=str(hex(k))[2:]
    s="0"*(4-len(s))+s
    return s
def k2im(k):
    rtext=font.render(chr(k), (0, 0, 0),(255, 255,255))
    pil_string_image = pygame.image.tostring(rtext[0], "RGB")
    pil_image = Image.frombytes("RGB",rtext[0].get_size(),pil_string_image).resize((IMAGE_WIDTH,IMAGE_HEIGHT))
    im=np.array(pil_image.convert("1"))
    return im
def gen_captcha_text_and_image(k):
    captcha_text=k2name(k)
    captcha_image=k2im(k)
    return captcha_text, captcha_image

# text, image = gen_captcha_text_and_image(55)
# print("验证码图像channel:", image.shape)  # (60, 160, 3)
# 图像大小

MAX_CAPTCHA = 1
print("验证码文本最长字符数", MAX_CAPTCHA)   # 验证码最长4字符; 我全部固定为4,可以不固定. 如果验证码长度小于4,用'_'补齐
 
# 把彩色图像转为灰度图像(色彩对识别验证码没有什么用)
def convert2gray(img):
    if len(img.shape) > 2:
        # gray = np.mean(img, -1)
        # 上面的转法较快,正规转法如下
        r, g, b = img[:,:,0], img[:,:,1], img[:,:,2]
        gray = 0.2989 * r + 0.5870 * g + 0.1140 * b
        return gray
    else:
        return img
 
"""
cnn在图像大小是2的倍数时性能最高, 如果你用的图像大小不是2的倍数,可以在图像边缘补无用像素。
np.pad(image,((2,3),(2,2)), 'constant', constant_values=(255,))  # 在图像上补2行,下补3行,左补2行,右补2行
"""
 
# 文本转向量
# char_set = number + alphabet + ALPHABET + ['_']  # 如果验证码长度小于4, '_'用来补齐
# CHAR_SET_LEN = len(char_set)
CHAR_SET_LEN=16
CHAR_SET_LEN=len(sgbc)
print(CHAR_SET_LEN)
def text2vec(text): 
    vector = np.zeros(MAX_CAPTCHA*CHAR_SET_LEN)
    idx = int("0x"+text,16)
    vector[sgbc.index(idx)] = 1
    return vector

# 生成一个训练batch
def get_next_batch(kks):
    batch_size=len(kks)
    batch_x = np.zeros([batch_size, IMAGE_HEIGHT*IMAGE_WIDTH])
    batch_y = np.zeros([batch_size, MAX_CAPTCHA*CHAR_SET_LEN])
 
    # 有时生成图像大小不是(60, 160, 3)
    i=0
    for kk in kks:
        text, image = gen_captcha_text_and_image(kk)
        image = convert2gray(image)
        batch_x[i,:] = image.flatten() / 1 # (image.flatten()-128)/128  mean为0
        batch_y[i,:] = text2vec(text)
        i+=1
    return batch_x, batch_y
 
####################################################################
 
X = tf.placeholder(tf.float32, [None, IMAGE_HEIGHT*IMAGE_WIDTH])
Y = tf.placeholder(tf.float32, [None, MAX_CAPTCHA*CHAR_SET_LEN])
keep_prob = tf.placeholder(tf.float32) # dropout
 
# 定义CNN
def crack_captcha_cnn(w_alpha=0.01, b_alpha=0.1):
    x = tf.reshape(X, shape=[-1, IMAGE_HEIGHT, IMAGE_WIDTH, 1])
 
    #w_c1_alpha = np.sqrt(2.0/(IMAGE_HEIGHT*IMAGE_WIDTH)) #
    #w_c2_alpha = np.sqrt(2.0/(3*3*32)) 
    #w_c3_alpha = np.sqrt(2.0/(3*3*64)) 
    #w_d1_alpha = np.sqrt(2.0/(8*32*64))
    #out_alpha = np.sqrt(2.0/1024)
 
    # 3 conv layer
    w_c1 = tf.Variable(w_alpha*tf.random_normal([3, 3, 1, 16]))
    b_c1 = tf.Variable(b_alpha*tf.random_normal([16]))
    conv1 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(x, w_c1, strides=[1, 1, 1, 1], padding='SAME'), b_c1))
    conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    conv1 = tf.nn.dropout(conv1, keep_prob)
    print(conv1.shape)
 
    w_c2 = tf.Variable(w_alpha*tf.random_normal([3, 3, 16, 32]))
    b_c2 = tf.Variable(b_alpha*tf.random_normal([32]))
    conv2 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv1, w_c2, strides=[1, 1, 1, 1], padding='SAME'), b_c2))
    conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    conv2 = tf.nn.dropout(conv2, keep_prob)
    print(conv2.shape)
 
    w_c3 = tf.Variable(w_alpha*tf.random_normal([3, 3, 32, 64]))
    b_c3 = tf.Variable(b_alpha*tf.random_normal([64]))
    conv3 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv2, w_c3, strides=[1, 1, 1, 1], padding='SAME'), b_c3))
    conv3 = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    conv3 = tf.nn.dropout(conv3, keep_prob)
    print(conv3.shape)
 
    # Fully connected layer
    w_d = tf.Variable(w_alpha*tf.random_normal([8*8*64, 1024]))
    b_d = tf.Variable(b_alpha*tf.random_normal([1024]))
    dense = tf.reshape(conv3, [-1, w_d.get_shape().as_list()[0]])
    dense = tf.nn.relu(tf.add(tf.matmul(dense, w_d), b_d))
    dense = tf.nn.dropout(dense, keep_prob)
    
 
    w_out = tf.Variable(w_alpha*tf.random_normal([1024, MAX_CAPTCHA*CHAR_SET_LEN]))
    b_out = tf.Variable(b_alpha*tf.random_normal([MAX_CAPTCHA*CHAR_SET_LEN]))
    out = tf.add(tf.matmul(dense, w_out), b_out)
    #out = tf.nn.softmax(out)
    return out
 
# 训练
def train_crack_captcha_cnn():
    output = crack_captcha_cnn()
    # loss
    #loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(output, Y))
    print("ddddddddd",output.shape,Y.shape)
    loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=output, labels=Y))
        # 最后一层用来分类的softmax和sigmoid有什么不同?
    # optimizer 为了加快训练 learning_rate应该开始大,然后慢慢衰
    optimizer = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(loss)
 
    predict = tf.reshape(output, [-1, MAX_CAPTCHA, CHAR_SET_LEN])
    max_idx_p = tf.argmax(predict, 2)
    max_idx_l = tf.argmax(tf.reshape(Y, [-1, MAX_CAPTCHA, CHAR_SET_LEN]), 2)
    correct_pred = tf.equal(max_idx_p, max_idx_l)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
 
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, tf.train.latest_checkpoint("e://model5/"))
        # sess.run(tf.global_variables_initializer())
        step = 0
        p=1
        xh=0
        tc=0
        change_ttf(ttf_names[xh%len(ttf_names)])
        while True:
            random.shuffle(gbc)
            kks=[]
            for k in gbc:
                kks.append(k)
                if len(kks)>0 and len(kks)%200==0:
                    batch_x, batch_y = get_next_batch(kks)
                    _, loss_ = sess.run([optimizer, loss], feed_dict={X: batch_x, Y: batch_y, keep_prob: kb})
                    print(step, loss_)
                    kks=[]
                    step += 1
            else:
                if kks:
                    batch_x, batch_y = get_next_batch(kks)
                    _, loss_ = sess.run([optimizer, loss], feed_dict={X: batch_x, Y: batch_y, keep_prob: kb})
                    print(step, loss_)
                    kks=[]
                    step += 1
                # if xh%20==0:
                if True:

                    xh+=1
                    change_ttf(ttf_names[xh%len(ttf_names)])

                    random.shuffle(gbc)
                    kks=[]
                    for k in gbc:
                        kks.append(k)
                        if len(kks)>0 and len(kks)%200==0:
                            break
                    batch_x_test, batch_y_test = get_next_batch(kks)
                    acc = sess.run(accuracy, feed_dict={X: batch_x_test, Y: batch_y_test, keep_prob: 1.})
                    print("预测数据:",xh, acc)
                    

                    if acc >= p:

                        # xh+=1
                        # change_ttf(ttf_names[xh%len(ttf_names)])

                        p=acc
                        pp=int(str(acc)[2:])

                        saver.save(sess, "e:/model5/good.model", global_step=pp)

                        tc+=1
                        if tc>=20000:
                            return True
train_crack_captcha_cnn()


 

我遇到的字体反爬是全字符字体反爬,要比猫眼的只对数字进行字体反爬难度要大的多,所以你们训练的时候如果网站只对

数字进行字体反爬,那就只训练数字即可,准确率应该100%。 

 

 

 

  • 3
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值