tesseract5 笔记新字体训练

原创已于 2024-07-29 16:01:00 修改 · 461 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#笔记

于 2024-07-29 15:59:29 首次发布

windows 安装git，之后在 git bash中运行
clone
https://github.com/tesseract-ocr/tesstrain

在文件夹内建一个data文件夹
再建一个new_tha-ground-truth文件夹
里面放入生成的.tif，.box，.gt.txt，文件
文件名要相同
.tif：图片
.box：图片对应的坐标
.gt.txt: 图片里对应的文字
data下建一个new_tha文件夹
~~里面放的文件待补~~

命令

make training MODEL_NAME=new_tha START_MODEL=tha

warning：

Setting unichar properties
Setting script properties
Warning: properties incomplete for index 18 = ึ
Warning: properties incomplete for index 20 = ุ
Warning: properties incomplete for index 25 = ็
Warning: properties incomplete for index 27 = ิ
Warning: properties incomplete for index 29 = ั
Warning: properties incomplete for index 44 = ี
Warning: properties incomplete for index 49 = ้
Warning: properties incomplete for index 51 = ์
Warning: properties incomplete for index 53 = ื
Warning: properties incomplete for index 55 = ู
Warning: properties incomplete for index 59 = ่
Warning: properties incomplete for index 69 = ๊
Warning: properties incomplete for index 71 = ํ
Warning: properties incomplete for index 74 = ๋
Warning: properties incomplete for index 119 = ~

说明unicharset里，某些字符的属性未完全设置。
找到对应字符的box文件
就改box里对应的字符

解决办法

1.首先，提取字符集：
制作 generate_unicharset.sh 文件在根目录

#!/bin/bash

# Define the directory containing the .box files
BOX_DIR="data/new_tha-ground-truth"
MERGED_BOX="all_boxes.box"
UNICHARSET_OUTPUT="unicharset"
NEW_UNICHARSET_OUTPUT="new_unicharset"
LANG_CONFIG="lang.config"

# Merge all .box files into one
find "$BOX_DIR" -name '*.box' -exec cat {} + > "$MERGED_BOX"

# Extract the unicharset from the merged .box file
/d/program/Tesseract-OCR/unicharset_extractor "$MERGED_BOX"

在git bash里执行

chmod +x generate_unicharset.sh
./generate_unicharset.sh

将生成的unicharset 复制到另一个项目 langdata_lstm\tha下
在 git bash 里执行

$ set_unicharset_properties -U unicharset -O new_unicharset -X tha.config --script_dir ..

-script_dir … ：是上一层目录含有Thai.unicharset，Latin.unicharset 两个文件
会生成
new_unicharset文件重命名unicharset放回 tesstrain\data\new_tha里

tesseract5 笔记 新字体训练

命令

warning：

解决办法

tesseract5 笔记新字体训练