tesseract ocr训练样本库以及样本库使用

最新推荐文章于 2025-03-17 13:22:51 发布

beizishaizi

最新推荐文章于 2025-03-17 13:22:51 发布

阅读量1.1k

点赞数

文章标签： c# android 开发语言

本文链接：https://blog.csdn.net/beizishaizi/article/details/130229227

版权

文章讲述了如何在MacOS上训练TesseractOCR样本库，特别是针对Android应用中识别恶意广告延迟出现的叉号。通过训练只识别X的样本库，提高OCR识别速度。同时，介绍了训练过程，包括安装Tesseract和jTessBoxEditor，构建和调整样本，以及在Android应用中使用训练好的样本库进行文字识别。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

前言

这篇报告主要介绍两个内容：
1.tesseract ocr训练样本库相关工具和命令
2.训练好的样本库在Android应用中的使用
目的是为了解决，如何快速识别恶意广告中延迟出现的叉号。这种方式其实就是将叉号默认识别为X，整个训练的样本图片都是来自于应用中的叉号图片，最终也只需要识别X。这样训练出的样本库在进行ocr识别时速度会加快很多。
这种方式也可以来弥补白名单广告SDK的问题，一般广告view上会带有“广告”，“ad”，“Ad”这有限几个关键词，再配合流量触发检测时机，就可以判断当前view是否是广告view。

一、tesseract ocr训练样本库

这方面相关文档资料较多，用如题关键词搜索即可。以下内容记录一下过程在MacOS上的过程。
第一步：下载和安装tesseract ocr引擎和jTessBoxEditor
因为要训练样本，使用brew install --with-training-tools tesseract；如果不训练样本可直接brew install tesseract。但是安装失败，提示–with-training-tools不可知。
参考官网：
https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos
执行以下命令安装

# Packages which are always needed.
brew install automake autoconf libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
# Packages required for training tools.
brew install pango
# Optional packages for extra features.
brew install libarchive
# Optional package for builds using g++.
brew install gcc

git clone https://github.com/tesseract-ocr/tesseract/
cd tesseract
./autogen.sh
mkdir build
cd build
# Optionally add CXX=g++-8 to the configure command if you really want to use a different compiler.
../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig
make -j
# Optionally install Tesseract.
sudo make install
# Optionally build and install training tools.
# 这里还需要再跑一遍./configure命令
./configure
make training
sudo make training-install

安装之后，还需要下载eng.traineddata（https://github.com/tesseract-ocr/tessdata），并复制到/usr/local/share/tessdata目录中，这个可能是因设备而异，在安装过程中会有提示。

下载jTessBoxEditor，参考https://tesseract-ocr.github.io/tessdoc/AddOns即可

第二步：训练样本
构建文件夹xunlian,其中包含有样本图片和一个名为new的文件夹
样本图片需要转换为tif格式，参考在线网站https://onlineconvertfree.com/zh/convert/jpeg/。
步骤：

java -jar jTessBoxEditor.jar ，启动之后，选择点菜单上的Tool->Merge TIFF，选择样本图片（tif格式或原图好像都可以）保存到new文件夹中，保存的时候注意名字sll.normal.exp0.tif。名字含义参考官网。
训练命令

 1126  ../../tesseract-5.3.1/build/tesseract sll.normal.exp1.tif sll.normal.exp1 --psm 7 -l eng batch.nochop makebox
 1127  ../../tesseract-5.3.1/build/tesseract sll.normal.exp1.tif sll.normal.exp1 --psm 7 nobatch box.train
 1128  ../../tesseract-5.3.1/build/unicharset_extractor sll.normal.exp1.box
 1129  ../../tesseract-5.3.1/build/shapeclustering -F font_properties -U unicharset sll.normal.exp1.tr
 1130  ../../tesseract-5.3.1/build/mftraining -F font_properties -U unicharset -O unicharset sll.normal.exp1.tr
 1131  ../../tesseract-5.3.1/build/cntraining sll.normal.exp1.tr
 1132  ../../tesseract-5.3.1/build/combine_tessdata normal.

在第一个命令之后，生成的box文件可能需要调整，使用 jTessBoxEditor工具，选择Box Editor->open，打开sll.normal.exp1.tif文件，进行插入、删除，修改坐标位置等等即可。
此外还先要生成font_properties文件，内容为normal 0 0 0 0 0，normal是和sll.normal.exp1.tif中的normal对应的，含义应该是表示字体。

另外，在第一和第二条命令中加了–psm 7，如果不加这一条，生成的文件大小为0，会不断提示empty page！！错误。
在执行最后一条命令时，还需要对生成的五个文件重命名，名字前面添加“normal.”。如下截图所示，除了normal.traineddata文件。
在这里插入图片描述
最后，将生成的normal.traineddata文件拷贝到/usr/local/share/tessdata目录中，并进行验证。

 1133  sudo cp normal.traineddata /usr/local/share/tessdata
a123456@zhangy-MacBook-Pro new % ../../tesseract-5.3.1/build/tesseract -l normal --psm 7 cha1.png stdout                    
x
a123456@zhangy-MacBook-Pro new % ../../tesseract-5.3.1/build/tesseract -l normal --psm 7 cha10.png stdout
X

二、样本库的使用

这部分介绍，如何在android应用程序中使用训练生成的样本库。
参考链接https://juejin.cn/post/7209882068636172349过程。
但是由于没有相关so文件，运行失败。

编译Tesseract4Android（https://github.com/adaptech-cz/Tesseract4Android），生成apk，然后从apk中获取so文件
添加依赖：
build.gradle文件中添加implementation “com.rmtheis:tess-two:8.0.0”，和
android {
defaultConfig {
…
ndk {
abiFilters ‘arm64-v8a’, ‘armeabi-v7a’
}
将生成的样本库放在assets目录下，并拷贝到scared目录中

public class TessOcr {
    public String mDataPath = Environment.getExternalStorageDirectory().getAbsolutePath();
    public String mFilePath = mDataPath + File.separator + "tessdata" + File.separator + "normal.traineddata";
    private void copyFile(Context context) {
        try {
            File mFile = new File(mFilePath);
            if (mFile.exists()) {
                mFile.delete();
            }
            if (!mFile.exists()) {
                File p = new File(mFile.getParent());
                if (!p.exists()) {
                    p.mkdirs();
                }
                try {
                    mFile.createNewFile();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }

            OutputStream os = new FileOutputStream(mFilePath);
            InputStream is = context.getAssets().open("normal.traineddata");
            byte[] buffer = new byte[1024];
            int len = 0;
            while ((len = is.read(buffer)) != -1) {
                os.write(buffer, 0, len);
            }
            os.flush();
            os.close();
            is.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

tess two初始化和图片处理

    public String ocridentify(Context context,File bitmap)
    {
        copyFile(context);
        TessBaseAPI baseApi;
        baseApi = new TessBaseAPI();
        Log.i("result", mDataPath);
        baseApi.init(mDataPath, "normal"); //mDataPath目录下必须包含一个文件夹tessdata目录
        baseApi.setPageSegMode(7); //与命令行中--psm 7等价，不设置识别不出来
        Log.i("result time1:", Long.toString(System.currentTimeMillis()));
        baseApi.setImage(bitmap);
        Log.i("result time2:", Long.toString(System.currentTimeMillis()));
        String result = baseApi.getUTF8Text().replace(" ", "").toLowerCase();
        Log.i("result time3:", Long.toString(System.currentTimeMillis()));
        Log.i("result", result);
        return result;
    }