windows 安装git,之后在 git bash中运行
clone
https://github.com/tesseract-ocr/tesstrain
在文件夹内建一个data文件夹
再建一个new_tha-ground-truth文件夹
里面放入生成的.tif,.box,.gt.txt,文件
文件名要相同
.tif:图片
.box:图片对应的坐标
.gt.txt: 图片里对应的文字
data下建一个new_tha文件夹
里面放的文件待补
命令
make training MODEL_NAME=new_tha START_MODEL=tha
warning:
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 18 = ึ
Warning: properties incomplete for index 20 = ุ
Warning: properties incomplete for index 25 = ็
Warning: properties incomplete for index 27 = ิ
Warning: properties incomplete for index 29 = ั
Warning: properties incomplete for index 44 = ี
Warning: properties incomplete for index 49 = ้
Warning: properties incomplete for index 51 = ์
Warning: properties incomplete for index 53 = ื
Warning: properties incomplete for index 55 = ู
Warning: properties incomplete for index 59 = ่
Warning: properties incomplete for index 69 = ๊
Warning: properties incomplete for index 71 = ํ
Warning: properties incomplete for index 74 = ๋
Warning: properties incomplete for index 119 = ~
说明unicharset里,某些字符的属性未完全设置。
找到对应字符的box文件
就改box里对应的字符
解决办法
1.首先,提取字符集:
制作 generate_unicharset.sh 文件在根目录
#!/bin/bash
# Define the directory containing the .box files
BOX_DIR="data/new_tha-ground-truth"
MERGED_BOX="all_boxes.box"
UNICHARSET_OUTPUT="unicharset"
NEW_UNICHARSET_OUTPUT="new_unicharset"
LANG_CONFIG="lang.config"
# Merge all .box files into one
find "$BOX_DIR" -name '*.box' -exec cat {} + > "$MERGED_BOX"
# Extract the unicharset from the merged .box file
/d/program/Tesseract-OCR/unicharset_extractor "$MERGED_BOX"
在git bash里执行
chmod +x generate_unicharset.sh
./generate_unicharset.sh
将生成的unicharset 复制到 另一个 项目 langdata_lstm\tha下
在 git bash 里执行
$ set_unicharset_properties -U unicharset -O new_unicharset -X tha.config --script_dir ..
-script_dir … :是上一层目录含有Thai.unicharset,Latin.unicharset 两个文件
会生成
new_unicharset文件重命名unicharset放回 tesstrain\data\new_tha里