开源gosseract
地址 gosseract
1. 首先下载docker 和docker-compose ,具体的查看官网
注意 docker版本和docker-compose 版本一致
2. gosseract 在源码中找到dockerfile 文件如下(做了稍微的修改,只安装golang 和 tesseract-ocr/):
##### docker build -f docker_file -t otiai10:0.1 .
# This is a working example of setting up tesseract/gosseract,
# and also works as an example runtime to use gosseract package.
# You can just hit `docker run -it --rm otiai10/gosseract`
# to try and check it out!
#####
FROM golang:latest
LABEL maintainer="Hiromu Ochiai <otiai10@gmail.com>"
RUN apt-get update -qq
# You need librariy files and headers of tesseract and leptonica.
# When you miss these or LD_LIBRARY_PATH is not set to them,
# you would face an error: "tesseract/baseapi.h: No such file or directory"
RUN apt-get install -y -qq libtesseract-dev libleptonica-dev
RUN apt update \
&& apt install -y \
ca-certificates \
libtesseract-dev=4.1.1-2+b1 \
tesseract-ocr=4.1.1-2+b1
# In case you face TESSDATA_PREFIX error, you minght need to set env vars
# to specify the directory where "tessdata" is located.
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
# Load languages.
# These {lang}.traineddata would b located under ${TESSDATA_PREFIX}/tessdata.
# 安装中文
RUN apt-get install -y -qq \
tesseract-ocr-eng \
tesseract-ocr-deu \
#tesseract-ocr-chi_sim \
tesseract-ocr-jpn
#检查「tesseract」支持的语言
# tesseract --list-langs
# See https://github.com/tesseract-ocr/tessdata for the list of available languages.
# https://github.com/tesseract-ocr/tessdata/tree/4.00 或者下载 https://tesseract-ocr.github.io/tessdoc/Data-Files
# If you want to download these traineddata via `wget`, don't forget to locate
# downloaded traineddata under ${TESSDATA_PREFIX}/tessdata.
RUN go env -w GO111MODULE=on
RUN go env -w GOPROXY=https://goproxy.cn,direct
#RUN go get -t github.com/otiai10/gosseract
RUN /go/src
#RUN cd ${GOPATH}/src/github.com/otiai10/gosseract && go test
# Now, you've got complete environment to play with "gosseract"!
# For other OS, check https://github.com/otiai10/gosseract/tree/main/test/runtimes
3. 在Dockerfile 目录下面编译镜像文件,命令如下(注意如果文件不是Dockerfile 需要 -f xxxxfile 指定编排文件)
docker build -t orc .
4. 下载过来查看镜像
docker images
5. 从官网下载源码运行,新建 docker-compose.yml 文件
创建网络
docker network create go_app_orc
编排 挂在你的项目目录 orc_test 挂在到容器里面 /go/src/orc_test
version: '3'
# docker network create go_app_orc
# docker cp /Users/gitxuzan/Downloads/chi_sim.traineddata orc:/usr/share/tesseract-ocr/4.00/tessdata
# docker exec -it orc_v2 bash
services:
golang:
image: orc
container_name: orc
networks:
- web
volumes:
- "../orc_test:/go/src/orc_test"
tty: true
environment:
TZ: "Asia/Shanghai"
networks:
web:
external:
name: go_app_orc
6. 启动容器
docker-compose up -d
7. 进入容器
docker exec -it orc bash
8. 进入挂载项目运行
代码如下:
package main
import (
"fmt"
"github.com/otiai10/gosseract/v2"
"log"
)
func main() {
client := gosseract.NewClient()
defer client.Close()
//client.SetLanguage("eng","chi_sim")
//client.SetLanguage("chi_sim")
client.SetImage("test2.png")
text, _ := client.Text()
fmt.Println(text)
// Hello, World!
}
我用的go mod 管理的 go mod tidy , go run main.go,识别结果如下
9. 最后如果需要识别中文,需要下载中文包拷贝到容器
中文包下载地址1
中文包下载地址2
拷贝到容器
docker cp /Users/gitxuzan/Downloads/chi_sim.traineddata orc:/usr/share/tesseract-ocr/4.00/tessdata
# 查看是否有中文包
tesseract --list-langs
设置识别语言
```go
package main
import (
"fmt"
"github.com/otiai10/gosseract/v2"
"log"
)
func main() {
client := gosseract.NewClient()
defer client.Close()
client.SetLanguage("eng","chi_sim")
client.SetImage("test2.png")
text, _ := client.Text()
fmt.Println(text)
// Hello, World!
}
识别中文结果:
现在存在的问题1
更新 4.1.1镜像
给作者提出issues 反馈
FROM debian:bullseye-slim
RUN apt-get update -qq
RUN apt-get install -y \
git \
golang \
libtesseract-dev=4.1.1-2.1 \
tesseract-ocr-eng
ENV GOPATH=/root/go
ADD . ${GOPATH}/src/github.com/otiai10/gosseract
WORKDIR ${GOPATH}/src/github.com/otiai10/gosseract
RUN tesseract --version
CMD ["go", "test", "-v", "./..."]
另外一种方式,不用cgo ,使用原始命令调用
下载地址: https://sourceforge.net/projects/tesseract-ocr.mirror/
windows 界面化下一步下一步安装,需要选择下中文包的下载
例如我这边安装目录是 C:\Program Files\Tesseract-OCR
然后吧这个目录添加到系统Path 里面
验证:
tesseract --version
下载语言包
# 简体中文
curl -O https://github.com/tesseract-ocr/tessdata/raw/main/chi_sim.traineddata
# 繁体中文
curl -O https://github.com/tesseract-ocr/tessdata/raw/main/chi_tra.traineddata
# 英文
curl -O https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
下载好的语言包放到对应的位置
例如mac上
brew --prefix tesseract
/opt/homebrew/opt/tesseract/share/tessdata
windows
C:\Program Files\Tesseract-OCR\tessdata
测试
使用英文和中文
tesseract example.png stdout -l chi_sim+eng 输出到标注输出
tesseract example.png output -l chi_sim+eng 输入到文件里面
写个程序调用
func main() {
// 设置输入图片和语言参数
inputImage := "example.png"
lang := "chi_sim+eng"
// 根据操作系统设置命令
var cmd *exec.Cmd
if runtime.GOOS == "windows" {
cmd = exec.Command("tesseract.exe", inputImage, "stdout", "-l", lang)
} else {
cmd = exec.Command("tesseract", inputImage, "stdout", "-l", lang)
}
// 创建一个缓冲区来捕获命令输出
var outBuffer, errBuffer bytes.Buffer
cmd.Stdout = &outBuffer
cmd.Stderr = &errBuffer
// 运行命令
if err := cmd.Run(); err != nil {
log.Println("Error executing command: %v\n", err)
return
}
// 打印标准输出
fmt.Println(outBuffer.String())
}
这个太复杂的识别有点不准
4.0.0 不能设置白名单,需要升级成4.1.1 ↩︎