ubuntu开发训练ocr记录
install tesseract
install dependency libs
sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install autoconf-archive
sudo apt-get install pkg-config
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev
sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
install tesseract and training tools
wget https://codeload.github.com/tesseract-ocr/tesseract/zip/3.04
./configure
make && make install
make training && make training-install
- 如果报错
checking for leptonica... configure: error: leptonica not found
:
wget http://www.leptonica.org/source/leptonica-1.72.tar.gz
tar xvzf leptonica-1.72.tar.gz
cd leptonica-1.72/
./configure
make && make install
- It is also useful, but not required, to build ScrollView.jar
make ScrollView.jar
export SCROLLVIEW_PATH=$PWD/java
- 测试识别:
tesseract test.png output_1 –l eng
语言包放置地址:/usr/local/share/tessdata
training
merge tif
jTessBoxEditor -> tools->merge tiff….
makebox
tesseract test.exp0.tif test.exp0 -l eng -psm 7 batch.nochop makebox
我使用 tesseract lang.font.exp0.tif lang.font.exp0 -l chi_sim -psm 3 batch.nochop makebox
change box
use jTessBoxEditor
make font_properties
新建一个名为font_properties的文本文件(注意该文件没有扩展名),内容为字体名fontyp,后面带5个0,分别代表字体的粗体、斜体等属性,这里全部是0
命令建立font_properties:echo font 0 0 0 0 0 >font_properties
make train file
tesseract lang.font.exp0.tif lang.font.exp0 -l chi_sim -psm 3 nobatch box.train
unicharset_extractor lang.font.exp0.box
shapeclustering -F font_properties -U unicharset -O lang.unicharset lang.font.exp0.tr
mftraining -F font_properties -U unicharset -O lang.unicharset lang.font.exp0.tr
cntraining lang.font.exp0.tr
- 重命名
mv normproto font.normproto
mv inttemp font.inttemp
mv pffmtable font.pffmtable
mv unicharset font.unicharset
mv shapetable font.shapetable
combine_tessdata font.