mac
brew install Tesseract
This is a short writeup of the working process I came up with for command-line OCR of a non-OCR’d PDF with searchable PDF output on OS X, after running into a thousand little gotchas. 1
Software Installation
- Install homebrew (if you haven’t already).
-
Install ImageMagick with TIFF and Ghostscript support:
brew install --with-libtiff --with-ghostscript imagemagick
-
Install Tesseract with all languages:
brew install --all-languages tesseract
- Install pdftk server from the package installer.
Processing Workflow
I’m going to assume you have a non-OCR’d PDF you want to convert into a searchable PDF.
-
Split and convert the PDF with ImageMagick
convert
:convert -density 300 input.pdf -type Grayscale -compress lzw -background white +matte -depth 32 page_%05d.tif
-
OCR the pages with Tesseract: 2 3
for i in page_*.tif; do echo $i; tesseract $i $(basename $i .tif) pdf; done
-
Join your individual PDF files into a single, searchable PDF with
pdftk
: 4pdftk page_*.pdf cat output merged.pdf
convert 9.png -resize 3000% -type Grayscale input9.tif (因为像素low所以要转)
tesseract input9.tif output9 -l eng
tesseract input9.png output9 (默认是eng英文)