如何使用Tesseract从Linux命令行执行OCR

最新推荐文章于 2024-07-08 09:38:15 发布

culinxia2707

最新推荐文章于 2024-07-08 09:38:15 发布

阅读量1.2k

点赞数

文章标签： python linux java 大数据 shell

原文链接：https://www.howtogeek.com/682389/how-to-do-ocr-from-the-linux-command-line-using-tesseract/

版权

A terminal window on a Linux laptop. — Fatmawati Achmad Zaenuri/Shutterstock Fatmawati Achmad Zaenuri / Shutterstock

You can extract text from images on the Linux command line using the Tesseract OCR engine. It’s fast, accurate, and works in about 100 languages. Here’s how to use it.

您可以使用Tesseract OCR引擎从Linux命令行上的图像中提取文本。它快速，准确，并且可以使用大约100种语言。这是使用方法。

光学字符识别 (Optical Character Recognition )

Optical character recognition (OCR) is the ability to look at and find words in an image, and then extract them as editable text. This simple task for humans is very difficult for computers to do. Early efforts were clunky, to say the least. Computers were often confused if the typeface or size was not to the OCR software’s liking.

光学字符识别 (OCR)是查看和查找图像中的单词，然后将其提取为可编辑文本的能力。对于人类而言，这项简单的任务对于计算机而言非常困难。至少可以说，早期的努力很笨拙。如果字体或大小不符合OCR软件的喜好，计算机通常会感到困惑。

Nevertheless, the pioneers in this field were still held in high esteem. If you lost the electronic copy of a document, but still had a printed version, OCR could re-create an electronic, editable version. Even if the results weren’t 100 percent accurate, this was still a great time-saver.

尽管如此，这一领域的开拓者仍然受到高度重视。如果您丢失了文档的电子副本，但仍然有印刷版本，OCR可以重新创建一个可编辑的电子版本。即使结果不是100％准确，这仍然可以节省大量时间。

With some manual tidying up, you’d have your document back. People were forgiving about the mistakes it made because they understood the complexity of the task facing an OCR package. Plus, it was better than retyping the entire document.

进行一些手动整理后，您将可以收回文件。人们会原谅它所犯的错误，因为他们了解OCR软件包所面临任务的复杂性。另外，这比重新键入整个文档更好。

Things have improved significantly since then. The Tesseract OCR application, written by Hewlett Packard, started in the 1980s as a commercial application. It was open-sourced in 2005, and it’s now supported by Google. It has multi-language capabilities, is regarded as one of the most accurate OCR systems available, and you can use it for free.

自那时以来，情况已显着改善。由Hewlett Packard编写的Tesseract OCR应用程序于1980年代开始作为商业应用程序使用。它于2005年开源，现在得到了Google的支持。它具有多语言功能，被认为是最准确的OCR系统之一，您可以免费使用它。

安装Tesseract OCR (Installing Tesseract OCR )

To install Tesseract OCR on Ubuntu, use this command:

要在Ubuntu上安装Tesseract OCR，请使用以下命令：

sudo apt-get install tesseract-ocr

On Fedora, the command is:

在Fedora上，命令为：

sudo dnf install tesseract

On Manjaro, you need to type:

在Manjaro上，您需要输入：

sudo pacman -Syu tesseract

使用Tesseract OCR (Using Tesseract OCR )

We’re going to pose a set of challenges to Tesseract OCR. Our first image that contains text is an extract from Recital 63 of the General Data Protection Regulations. Let’s see if OCR can read this (and stay awake).

我们将对Tesseract OCR提出一系列挑战。我们的第一个包含文本的图像是《通用数据保护条例》第 63条的摘录。让我们看看OCR是否可以读取此信息(并保持清醒状态)。

It’s a tricky image because each sentence starts with a faint superscript number, which is typical in legislative documents.

这是一个棘手的图像，因为每个句子都以微弱的上标数字开头，这在立法文件中很常见。

We need to give the tesseract command some information, including:

我们需要给tesseract命令一些信息，包括：

The name of the image file we want it to process.
我们要处理的图像文件的名称。
The name of the text file it will create to hold the extracted text. We don’t have to provide the file extension (it will always be .txt). If a file already exists with the same name, it will be overwritten.
它将创建的文本文件的名称，用于保存提取的文本。 我们不必提供文件扩展名(它将始终为.txt)。如果已经存在相同名称的文件，它将被覆盖。
We can use the --dpi option to tell tesseract what the dots per inch (dpi) resolution of the image is. If we don’t provide a dpi value, tesseract will try to figure it out.
我们可以使用--dpi选项来告诉tesseract 图像的每英寸点数 (dpi)分辨率是多少。 如果我们不提供dpi值，则tesseract会尝试找出它。

Our image file is named “recital-63.png,” and its resolution is 150 dpi. We’re going to create a text file from it called “recital.txt.”

我们的图像文件名为“ recital-63.png”，其分辨率为150 dpi。我们将根据该文件创建一个名为“ recital.txt”的文本文件。

Our command looks like this:

我们的命令如下所示：

tesseract recital-63.png recital --dpi 150

The results are very good. The only issue is the superscripts—they were too faint to be read correctly. A good quality image is vital to get good results.

结果非常好。唯一的问题是上标-它们太模糊而无法正确阅读。高质量的图像对于获得良好的结果至关重要。

tesseract has interpreted the superscript numbers as quotation marks (“) and degree symbols (°), but the actual text has been extracted perfectly (the right side of the image had to be trimmed to fit here).

tesseract已将上标数字解释为引号(“)和度数符号(°)，但实际文本已被完美提取(图像的右侧必须修剪以适合此处)。

The final character is a byte with the hexadecimal value of 0x0C, which is a carriage return.

最后一个字符是一个十六进制值为0x0C的字节，它是一个回车符。

Below is another image with text in different sizes, and both bold and italics.

下面是另一张图片，其中包含不同大小的文本，包括粗体和斜体。

Image with different sizes of text in bold and italics.

The name of this file is “bold-italic.png.” We want to create a text file called “bold.txt,” so our command is:

该文件的名称是“ bold-italic.png”。我们要创建一个名为“ bold.txt”的文本文件，因此我们的命令是：

tesseract bold-italic.png bold --dpi 150

This one didn’t pose any problems, and the text was extracted perfectly.

这个没有任何问题，并且文本被完美地提取了。

使用不同的语言 (Using Different Languages)

Tesseract OCR supports around 100 languages. To use a language, you must first install it. When you find the language you want to use in the list, note its abbreviation. We’re going to install support for Welsh. Its abbreviation is “cym,” which is short for “Cymru,” which means Welsh.

Tesseract OCR支持大约100种语言。要使用一种语言，您必须首先安装它。在列表中找到要使用的语言时，请注意其缩写。我们将安装对威尔士语的支持。它的缩写是“ cym”，是“ Cymru”的缩写，表示威尔士语。

The installation package is called “tesseract-ocr-” with the language abbreviation tagged onto the end. To install the Welsh language file in Ubuntu, we’ll use:

安装程序包称为“ tesseract-ocr-”，其语言缩写标记在最后。要在Ubuntu中安装威尔士语言文件，我们将使用：

sudo apt-get install tesseract-ocr-cym

The image with the text is below. It’s the first verse of the Welsh national anthem.

带有文字的图像如下。这是威尔士国歌的第一节经文。

image containing text of the first verse of the Welsh national anthem.

Let’s see if Tesseract OCR is up to the challenge. We’ll use the -l (language) option to let tesseract know the language in which we want to work:

让我们看看Tesseract OCR是否能应对挑战。我们将使用-l (语言)选项让tesseract知道我们要使用的语言：

tesseract hen-wlad-fy-nhadau.png anthem -l cym --dpi 150

tesseract hen-wlad-fy-nhadau.png anthem -l cym --dpi 150 in a terminal window.

tesseract copes perfectly, as shown in the extracted text below. Da iawn, Tesseract OCR.

tesseract可以完美应对，如下tesseract录所示。 达荷恩(Tawnract )OCR。

If your document contains two or more languages (like a Welsh-to-English dictionary, for example), you can use a plus sign (+) to tell tesseract to add another language, like so:

如果您的文档包含两种或多种语言(例如，威尔士英语词典)，则可以使用加号( + )告诉tesseract添加另一种语言，例如：

tesseract image.png textfile -l eng+cym+fra

将Tesseract OCR与PDF一起使用 (Using Tesseract OCR with PDFs)

The tesseract command is designed to work with image files, but it’s unable to read PDFs. However, if you need to extract text from a PDF, you can use another utility first to generate a set of images. A single image will represent a single page of the PDF.

tesseract命令旨在用于图像文件，但无法读取PDF。但是，如果需要从PDF中提取文本，则可以先使用另一个实用程序来生成一组图像。一张图像将代表PDF的一页。

The pdftppm utility you need should already be installed on your Linux computer. The PDF we’ll use for our example is a copy of Alan Turing’s seminal paper on artificial intelligence, “Computing Machinery and Intelligence.”

您所需的pdftppm实用程序应该已经安装在Linux计算机上。我们将用于示例的PDF是Alan Turing关于人工智能的开创性论文“计算机械与智能”的副本。

PDF of the title page of "Computing Machinery and Intelligence" by A.M. Turing.

We use the -png option to specify that we want to create PNG files. The file name of our PDF is “turing.pdf.” We’ll call our image files “turing-01.png,” “turing-02.png,” and so on:

我们使用-png选项来指定我们要创建PNG文件。 PDF的文件名为“ turing.pdf”。我们将图像文件称为“ turing-01.png”，“ turing-02.png”，依此类推：

pdftoppm -png turing.pdf turing

To run tesseract on each image file using a single command, we need to use a for loop. For each of our “turing-nn.png,” files we run tesseract, and create a text file called “text-” plus “turing-nn” as part of the image file name:

要使用单个命令在每个图像文件上运行tesseract ，我们需要使用for循环。对于每个“ turing- nn .png”文件，我们运行tesseract ，并创建一个名为“ text-”加“ turing- nn ”的文本文件作为图像文件名的一部分：

for i in turing-??.png; do tesseract "$i" "text-$i" -l eng; done;

for i in turing-??.png; do tesseract "$i" "text-$i" -l eng; done; in a terminal window.

To combine all the text files into one, we can use cat:

要将所有文本文件合并为一个，我们可以使用cat ：

cat text-turing* > complete.txt

So, how did it do? Very well, as you can see below. The first page looks quite challenging, though. It has different text styles and sizes, and decoration. There’s also a vertical “watermark” on the right edge of the page.

那么，它是怎么做的呢？很好，如下所示。但是，首页看起来很有挑战性。它具有不同的文本样式和大小以及装饰。页面右侧还有一个垂直的“水印”。

However, the output is close to the original. Obviously, the formatting was lost, but the text is correct.

但是，输出接近原始输出。显然，格式丢失了，但是文本是正确的。

First page of extracted text from the Turing PDF.

The vertical watermark was transcribed as a line of gibberish at the bottom of the page. The text was too small to be read by tesseract accurately, but it would be easy enough to find and delete it. The worst result would have been stray characters at the end of each line.

垂直水印在页面底部被记录为乱码。文本太小，无法被tesseract准确地阅读，但是找到并删除它很容易。最糟糕的结果是每行结尾处的字符混乱。

Curiously, the single letters at the start of the list of questions and answers on page two have been ignored. The section from the PDF is shown below.

奇怪的是，第二页问答列表开头的单个字母已被忽略。 PDF中的部分如下所示。

A list of questions and answers from the PDF of the Turing paper.

As you can see below, the questions remain, but the “Q” and “A” at the start of each line were lost.

如下所示，问题仍然存在，但是每行开头的“ Q”和“ A”丢失了。

Extracted text from the question and answer page of the Turing PDF.

Diagrams also won’t be transcribed correctly. Let’s look at what happens when we try to extract the one shown below from the Turing PDF.

图表也不会正确转录。让我们看看当尝试从Turing PDF中提取下图所示的内容时会发生什么。

A diagram of "Input" and "Last State" from the Turing PDF.

As you can see in our result below, the characters were read, but the format of the diagram was lost.

正如您在下面的结果中看到的那样，已读取了字符，但是该图的格式丢失了。

Extracted text from a diagram in the Turing PDF.

Again, tesseract struggled with the small size of the subscripts, and they were rendered incorrectly.

再次， tesseract与下标的小尺寸作斗争，并且下标的绘制不正确。

In fairness, though, it was still a good result. We weren’t able to extract straightforward text, but then, this example was deliberately chosen because it presented a challenge.

公平地说，这仍然是一个很好的结果。我们无法提取出简单的文本，但是后来故意选择了这个示例，因为它带来了挑战。

需要时的好解决方案 (A Good Solution When You Need It)

OCR isn’t something you’ll need to use daily. However, when the need does arise, it’s good to know you have one of the best OCR engines at your disposal.

OCR不是您每天都需要使用的东西。但是，当确实有需要时，很高兴知道您拥有最好的OCR引擎之一。