Tesseract + jTessBoxEditor 训练模型，自动识别验证码

LLLL96

已于 2024-08-30 15:07:09 修改

阅读量365

点赞数 12

文章标签： go python 图像处理计算机视觉 java

于 2024-08-30 15:00:53 首次发布

本文链接：https://blog.csdn.net/jhgj56/article/details/141716083

版权

准备工作

Tesseract安装

官网地址：
https://github.com/tesseract-ocr/tesseract

语言包地址：https://github.com/tesseract-ocr/tessdata

配置tesseract环境变量：

配置tesseract语言包环境：

验证tesseract安装成功：

安装jTessBoxEditor

地址：GitHub - nguyenq/jTessBoxEditor: Box editor and trainer for Tesseract OCR

下载验证码

func TestImagegenerateImage(t *testing.T) {
	//
	ticker := time.NewTicker(1 * time.Second) // 每秒钟触发一次
	defer ticker.Stop()
    // 下载50张 验证码
	for i := 0; i < 25; i++ {
		<-ticker.C // 等待每秒钟
		now := time.Now()
		timestamp := now.Unix()
		url := "http://aaa.bbb.com"
		imageNmae := fmt.Sprintf("%d.jpg", timestamp)
		GetImages(url,imageNmae)

	}

	// 等待最后一个请求完成
	time.Sleep(5 * time.Second) // 根据实际情况调整等待时间

}
func GetImages(url, filename string) ([]byte, error) {
	client := &http.Client{}
	req, err := http.NewRequest("GET", url, nil)
	if err != nil {
		return nil, fmt.Errorf("failed to create GET request: %w", err)
	}
	resp, err := client.Do(req)
	if err != nil {
		return nil, fmt.Errorf("failed to send GET request: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		return nil, fmt.Errorf("bad status: %s", resp.Status)
	}

	filepath := filepath.Join("D:\\images", filename)
	err = os.Remove(filepath)

	out, err := os.Create(filepath)
	if err != nil {
		return nil, fmt.Errorf("failed to create file: %w", err)
	}
	defer out.Close()

	imageData, err := io.ReadAll(resp.Body)
	_, err = out.Write(imageData)
	if err != nil {
		return nil, fmt.Errorf("failed to write to file: %w", err)
	}

	return imageData, nil
}

使用python ，对验证码进行切片


import os
from PIL import Image
import cv2
import numpy as np

input_folder = 'D:\images'
output_folder = 'D:\images\split'

class SplitImages:
    @staticmethod
    def splitall():
        for i in range(4):
            os.makedirs(os.path.join(output_folder), exist_ok=True)

        # 遍历输入文件夹中的所有图像文件
        for filename in os.listdir(input_folder):
            if filename.endswith(('.jpg', '.png', '.bmp')):
                img_path = os.path.join(input_folder, filename)
                img = Image.open(img_path)

                boxes = [
                    (15,2,30,19),
                    (30,2,45,19),
                    (45,2,60,19),
                    (60,2,75,19),
                ]
                # 裁剪图像并保存每一块
                for i, box in enumerate(boxes):
                    cropped_img = img.crop(box)

                    cropped_img = np.array(cropped_img)
                    # 创建一个掩模，用于识别黑色部分
                    # 假设黑色的范围是接近 (0, 0, 0) 的颜色
                    mask = cv2.inRange(cropped_img, (0, 0, 0), (15, 15, 15))  # 根据需要调整阈值
                    # 将黑色区域替换为白色
                    cropped_img[mask > 0] = [255, 255, 255]
                    # 灰图
                    cropped_img = cv2.cvtColor(np.array(cropped_img), cv2.COLOR_BGR2GRAY)
                    cropped_img = Image.fromarray(cropped_img)
                    output_path = os.path.join(output_folder, f"{os.path.splitext(filename)[0]}_{i+1}{os.path.splitext(filename)[1]}")
                    cropped_img.save(output_path)


if __name__ == '__main__':
    SplitImages.splitall()

使用jTessBoxEditor 校准模型

点击Tool ->Merge TIFF

找到切片的文件夹，文件类型选择ALL Image Files，全选图片，点击打开

文件名输入num.font.exp0.tif，点击保存

此时split文件夹下会生成一个 num.font.exp0.tif 文件

在split文件夹打开cmd

输入命令：tesseract num.font.exp0.tif num.font.exp0 --psm 10 batch.nochop makebox

注意第一次不加-l num ，使用默认eng 语言包进行识别

打开第一次识别后的结果

手动调整识别结果

点击下一页

修改后记得保存

训练模型

新建一个文件夹，手动创建font_properties文件

先创建font_properties.txt，文件里面写font 0 0 0 0 0 然后去掉txt后缀

其中font 与 num.font.exp0.tif中font对应，如果num.font.exp0.tif中改了，font也要改。

制作批量操作脚本

先手动创建do.txt文件，然后另存为do.bat

echo Run Tesseract for Training.. 
tesseract.exe num.font.exp0.tif num.font.exp0 --psm 10 nobatch box.train 


echo Compute the Character Set.. 
unicharset_extractor.exe num.font.exp0.box
mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr


echo Clustering.. 
cntraining.exe num.font.exp0.tr
echo Rename Files.. 
rename normproto num.normproto 
rename inttemp num.inttemp 
rename pffmtable num.pffmtable 
rename shapetable num.shapetable  

echo Create Tessdata.. 
combine_tessdata.exe num. 

echo. & pause

将split文件夹下的 num.font.exp0.tif 、num.font.exp0.box 文件，复制到和font_properties 同一文件夹下

双击，do.bat 后获取 num.traineddata ，这就是我们自己训练的语言包了

将 num.traineddata 放到该路径下，之后就可以在调用tesseract 的时候使用 -l num
例如：tesseract -l num --psm 10 1724997839_1.jpg 1.txt

多次训练

重复校准模型操作

将生成的 num.font.exp1.tif 、num.font.exp1.box ，放到do.bat 同一目录

修改do.bat

# 示例：
tesseract.exe num.font.exp0.tif num.font.exp0 -l num --psm 10 nobatch box.train 

tesseract.exe num.font.exp1.tif num.font.exp1 -l num --psm 10 nobatch box.train 

unicharset_extractor.exe num.font.exp0.box num.font.exp1.box

mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr num.font.exp1.tr

cntraining.exe num.font.exp0.tr num.font.exp1.tr

LLLL96

关注

12
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
Tesseract + jTessBoxEditor 训练模型，自动识别验证码

官网地址：https://github.com/tesseract-ocr/tesseract语言包地址：https://github.com/tesseract-ocr/tessdata配置tesseract环境变量：配置tesseract语言包环境：验证tesseract安装成功：地址：GitHub - nguyenq/jTessBoxEditor: Box editor and trainer for Tesseract OCR打开jTessBoxEditor，简单训练一次点击Tool -
复制链接

扫一扫