tesseract ocr + vs2012 + win10 + c++

最新推荐文章于 2024-07-11 13:40:19 发布

cwj_sunshine

最新推荐文章于 2024-07-11 13:40:19 发布

阅读量657

点赞数

分类专栏：程序文章标签： tesseract ocr

本文链接：https://blog.csdn.net/jbh_sunshine/article/details/102679742

版权

程序专栏收录该内容

54 篇文章 0 订阅

订阅专栏

一、tesseract ocr的准备工作

(1)tesseract的安装

安装程序
可以去我的资源中查找tesseract ocr下载，里面有安装程序和库文件，直接下一步即可完成安装，我的安装目录是C:\Program Files (x86)\Tesseract-OCR\setup
配置环境变量
用户变量和系统变量path都加上 C:\Program Files (x86)\Tesseract-OCR\setup
系统变量TESSDATA_PREFIX也要添加C:\Program Files (x86)\Tesseract-OCR\setup，不要加分号
验证是否安装正确
在cmd里面执行tesseract命令，若出来是一下信息则说明安装正确
Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile…]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
各种语言文件下载地址
地址

(2)训练字库软件-jTessBoxEditor的安装

可以在我的资源里面下载jTessBoxEditor安装程序以及jTessBoxEditor运行环境JAVA虚拟机安装程序 jdk-8u231-windows-i586.exe
https://download.csdn.net/download/jbh_sunshine/11896467
注意，安装虚拟机的路径后期不能动，不能更改该路径上面的任意文件夹的名字，否则之后就会打不开jTessBoxEditor软件。第一次使用的时候，安装完虚拟机后需要运行train.bat文件，随后才能通过双击jTessBoxEditor.jar文件打开jTessBoxEditor程序

二、训练字库

用于训练的图片命名是有规则的，必须是类似于num.font.exp0的命名方式，其中font代表的是字体的名称，num代表最后生成的字库的名字。
识别图片文字并生成txt文件
在图片的路径下进入cmd窗口，在cmd下输入 tesseract num.font.exp0.tif 222 -l eng 即可生成222.txt，文档里面的内容就是图片识别出来的结果。若生成的结果比较正确，则不用训练字库，可以直接使用自带的字库eng 进行识别。
出现报错 read_params_file: Can’t open eng，原因是命令使用不正确缺少-l,应该是tesseract num.font.exp0.tif 222 -l num或tesseract cnf_chi_tra.font.exp0.tif 1111 -l chi_tra -psm 7 nobatch(-psm 7 表示告诉tesseract code.jpg图片是一行文本这个参数可以减少识别错误率. 默认为 3)
生成.box文件
tesseract num.font.exp0.tif num.font.exp0 -l eng batch.nochop makebox
纠正识别字符
使用jTessBoxEditor打开(box editor-open-选中该box对应的图片)上面生成的num.font.exp0.box文件，逐个字符进行纠正，纠正完之后点击save，此时的box文件里面的内容就是纠正过后的正确的识别结果
如果图片未识别出来任意字符，则不能通过jTessBoxEditor软件对box文件进行操作(如果点击‘insert’按钮，则会提示：please select the box to insert after.但是由于box是空的，根本无法选择)，可以点击box文件右键以记事本打开，然后在写入 P 100 100 1000 1000 0，其中每行首列为图片识别的第一个字符，第二列开始这坐标信息，依次为x，y，width，height，最后一列为第几个图片，序号从0依次排序，第n个为n-1
产生tr文件
tesseract num.font.exp0.tif num.font.exp0 nobatch box.train
在命令行输入以上命令，则会在当前目录下生成.txt 和.tr文件，.tr文件后期需要使用。
我之前这步出现过没有生成tr文件的情况，搜索过后，网友给出解决办法，font_properties文件其实不是txt文件，不可以存在后缀名，不要手动生成。在cmd运行环境下，输入指令 echo font 0 0 0 0 0 >font_properties ，然后再进行这一步就可以自动生成我们需要的文件了。但是我的问题并没有解决
后来我发现是图片的原因

这个图片就不会生成tr文件，但是把 pass截小一点的话，如下图，就会生成tr文件，具体原因不详
产生字符集
unicharset_extractor num.font.exp0.box
生成inttemp（图像原型）、shapetable和pffmtable（字符出现次数）文件
mftraining -U unicharset -O num.unicharset num.font.exp0.tr
生成normproto文件
cntraining num.font.exp0.tr
重命名
rename normproto num.normproto
rename inttemp num.inttemp
rename pffmtable num.pffmtable
rename shapetable num.shapetable
合成num. traineddata
combine_tessdata num.
Offset2 4 5 6 这些项不是-1，才代表一个新的语言文件成功生成了
我遇到的问题
a.Failed to load font_properties from font_properties
https://blog.csdn.net/dragoo1/article/details/8439272和https://blog.csdn.net/sinat_28891771/article/details/71440547?locationNum=5&fps=1
b.jTessBoxEditor 无法保存以及合成tif文件
考虑图片的文件夹是不是需要提供管理员权限，以管理员身份运行jTessBoxEditor 软件

三、vs中使用tesseract ocr识别字符

下载上面链接里面的文件，里面有相关的l.ib和.h文件，添加到工程里面(此处注意，使用的库文件必须和你电脑上安装的版本相同，否则将会出错)
代码如下

#include  <io.h>
#include <vector>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <tesseract/strngs.h>
#include  <iostream> 
#include <opencv2\imgproc\imgproc.hpp>
#include <WINSOCK2.H> 
#include <opencv2\core\core.hpp>
#include <opencv2\highgui\highgui.hpp>
#include <opencv2\nonfree\features2d.hpp>
#include <opencv2\nonfree\nonfree.hpp>
#include <opencv2\legacy\legacy.hpp>


//将宽字节wchar_t*转化为单字节char* 
char* UnicodeToAnsi( const wchar_t* szStr) 
{ 
    int nLen = WideCharToMultiByte( CP_ACP, 0, szStr, -1, NULL, 0, NULL, NULL ); 
    if (nLen == 0) 
    { 
        return NULL; 
    } 
    char* pResult = new char[nLen]; 
    WideCharToMultiByte( CP_ACP, 0, szStr, -1, pResult, nLen, NULL, NULL ); 
    return pResult; 
 } 
int main()
{
	char Language[100];
	sprintf(Language,"eng");
	Mat Image = imread("1.jpg");
	CString outText;
	char PicFilename[10] = {"1.jpg"}
	
	tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI;
	tesseract::PageSegMode pagesegmode = tesseract::PSM_AUTO;
	// Initialize tesseract-ocr with Language, without specifying tessdata path
	
	char TraineddataPath[100];
	sprintf(TraineddataPath,"C:\\Program Files (x86)\\Tesseract-OCR\\tessdata\\%s.traineddata",Language);
	
	if((_access( TraineddataPath, 0 )) == -1)
	{
		printf(_T("Could not find %S file.\n"),TraineddataPath);
		return -1;
	}
	if (api->Init("NULL", Language)) 
	{        
		printf(_T("Could not initialize tesseract.\n"));	
		return -1;
	}
	//中文和英文的方法略有不同
	if(strstr(Language,"chi")==NULL)
	{
		if(api->GetPageSegMode() == tesseract::PSM_SINGLE_BLOCK)
			api ->SetPageSegMode(pagesegmode);
		PIX *pixs;
		if ((pixs = pixRead(PicFilename)) == NULL)
		{
			printf(_T("Unsupported image type\n"));		
			return -1;
		}
		pixDestroy(&pixs);
		STRING text;
		if(!api->ProcessPages(PicFilename,NULL,0,&text))
		{
			printf(_T("Error during processing\n"));
			return -1;
		}
		CString strMfc;
		string s;
		s=text.string();
		strMfc = s.c_str();
		strMfc.Replace(_T("\n"), _T(""));		
		
		outText = strMfc; 
	}
	else
	{
		api->SetImage(Image.data, Image.cols, Image.rows, 3, Image.step);
		char* test1;
		test1 = api->GetUTF8Text();
		wchar_t* tempchar = Utf_8ToUnicode(test1);
		char* resulttemp = UnicodeToAnsi(tempchar);
		USES_CONVERSION; 
		outText = A2T(resulttemp); 
		outText.Replace(_T("\n"), _T(""));
	}
	
	printf(_T("OCRText:%s\n"),outText);	
	api->End();
	return 0;
}

cwj_sunshine

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
tesseract ocr + vs2012 + win10 + c++

1.安装程序可以去我的资源中查找tesseract ocr下载，里面有安装程序和库文件，直接下一步即可完成安装，我的安装目录是C:\Program Files (x86)\Tesseract-OCR\setup2.配置环境变量用户变量和系统变量path都加上 C:\Program Files (x86)\Tesseract-OCR\setup系统变量TESSDATA_PREFIX也要添加C...
复制链接

扫一扫

专栏目录