识别图片验证码必备--Linux内网环境安装Tesseract

最新推荐文章于 2024-01-18 10:51:10 发布

夏至海

最新推荐文章于 2024-01-18 10:51:10 发布

阅读量395

点赞数

文章标签： linux java 机器学习

本文链接：https://blog.csdn.net/guoyanyang/article/details/105482112

版权

上期有关爬虫的文章中，验证码识别用到了Tesseract，笔者生产环境使用系统是GNU/Linux,网络不能连公网。这样在安装tesseract时就带来一些麻烦。验证码识别的过程，其实是调用系统命令去解析图片内容，如果把它做成服务，那任何人都可以使用了。

**Tesseract安装步骤如下：**

操作用户使用root
检查包依赖

需要逐个检查这些包是否已经在linux中安装

autoconf automake libtool libjpeg libpng libtiff zlib libjpeg-devel libpng-devel libtiff-devel zlib-devel

##检测是否安装
rpm -qa |grep autoconf

独立下载rpm包到本地

查询常用rpm包地址：http://rpmfind.net/linux/rpm2html/search.php?query=rpm

下载到本地并上传到服务器

安装依赖组件leptonica

tar -zxvf leptonica-*.tar.gz
#进入目录：
./configure
make 
make install

安装tesseract

./autogen.sh
./configure
make 
make install
ldconfig

一切就绪后拷贝训练数据

cp chi_sim.traineddata /usr/local/share/tessdata/
cp eng.traineddata /usr/local/share/tessdata/
cp eng.traineddata.part /usr/local/share/tessdata/

测试

#tesseract安装目录
cd ./testing
#上传一张图片到该目录，指定语言为英文，指定输出文件为result
tesseract validcode.jpg result -l eng
#查看识别结果
cat result

trubleShooting

如果测试结果不成功，可能的原因时依赖包缺失，可以再次安装。

也可能时图片质量问题，测试图片需要提前处理为纯色的文字，去掉杂色，只留下黑白。

public BufferedImage removeBackgroud(String picFile) throws Exception {
		BufferedImage img = ImageIO.read(new File(picFile));
		int width = img.getWidth();
		int height = img.getHeight();
		int x = 0;
		while (x < width) {
			int y = 0;
			while (y < height) {
				if (this.isWhite(img.getRGB(x, y)) == 1 || img.getRGB(x, y) == Color.GRAY.getRGB()) {
					img.setRGB(x, y, Color.WHITE.getRGB());
				} else {
					img.setRGB(x, y, Color.BLACK.getRGB());
				}
				++y;
			}
			++x;
		}
		return img;
	}

代码识别服务

public String recPic(File imageFile) throws Exception {
		BufferedReader in;
		File outputFile = new File(imageFile.getParentFile(), "output-"+ System.nanoTime());
		StringBuffer strB = new StringBuffer();
		ArrayList<String> cmd = new ArrayList<String>();
		cmd.add("tesseract");
		cmd.add("");
		cmd.add(outputFile.getName());
		cmd.add("-l");
		cmd.add("eng");
		ProcessBuilder pb = new ProcessBuilder(new String[0]);
		pb.directory(imageFile.getParentFile());
		cmd.set(1, imageFile.getName());//设置参数
		pb.command(cmd);//linux下执行系统调用
		pb.redirectErrorStream(true);
		Process process = pb.start();
		int w = process.waitFor();//等待执行结果
		if (w == 0) {
			String str="";
			in = new BufferedReader(new InputStreamReader(
					(InputStream) new FileInputStream(String.valueOf(outputFile.getAbsolutePath()) + ".txt"), "UTF-8"));
			while ((str = in.readLine()) != null) {
				strB.append(str).append(this.EOL);//读取结果
			}
		} else {
			throw new RuntimeException("识别异常");
		}
		in.close();
		return strB.toString();
	}