虚拟机与虚拟机传输文件_虚拟文件柜第2部分

最新推荐文章于 2024-01-25 15:48:25 发布

weixin_26722099

最新推荐文章于 2024-01-25 15:48:25 发布

阅读量117

点赞数

原文链接：https://medium.com/swlh/virtual-file-cabinet-part-2-9ce163c05abc

版权

虚拟机与虚拟机传输文件

Before we can classify our scanned documents, we must extract the text from them.

在对扫描的文档进行分类之前，我们必须从中提取文本。

Update: Part 3 has been posted and ties it all together with a functioning application that uses Natural Language Processing to classify the documents and file them into the file cabinet automatically.

更新：第3部分已发布，并将其全部与功能正常的应用程序联系在一起，该应用程序使用自然语言处理对文档进行分类并将其自动归档到文件柜中。

In part one of our small business office automation series we created a virtual file cabinet and began populating it with scanned documents. We wrote code to download the scanned documents sent to an IMAP based email account. We copy those attachments into an _inbox directory for later review and classification. As far as our system knows though we have plain image files. The system has no context to say what that image represents. And we cannot trust the filename because these are often generated by the scanning system and amount to “scanned document”. We need a method to generate some context for each file. We will explore using OCR (Optical Character Recognition)

在我们小型企业办公室自动化系列的第一个部分中，我们创建了一个虚拟文件柜，并开始向其中填充扫描的文档。我们编写了代码，以下载发送到基于IMAP的电子邮件帐户的扫描文档。我们将这些附件复制到_inbox目录中，以供以后查看和分类。据我们的系统知道，尽管我们有普通图像文件。该系统没有上下文可以说出该图像代表什么。而且我们不信任文件名，因为这些文件名通常是由扫描系统生成的，总计为“扫描的文档”。我们需要一种为每个文件生成一些上下文的方法。我们将探讨如何使用OCR(光学字符识别)

We will make use of Tesseract.js to extract the text from our scanned documents. There are other options but Tesseract gets a lot of mention when you Google for “node js extract text from image”. We’ll explore this and analyze how effective it is after we have used it for a bit. Tesseract.js wraps up the Tesseract API for use in Node.

我们将使用Tesseract.js从扫描的文档中提取文本。还有其他选择，但是当您使用Google的“ node js从图像提取文本”时， Tesseract得到了很多提及。我们将对此进行探索，并分析它的有效性。 Tesseract.js封装了Tesseract API以便在Node中使用。

To keep things simple we will configure the scanner to generate a .JPG file. We do this because Tesseract works with image files, not PDFs, Word Documents, PowerPoint files, etc.

为简单起见，我们将配置扫描仪以生成.JPG文件。我们这样做是因为Tesseract处理的是图像文件，而不是PDF，Word文档，PowerPoint文件等。

There is a whole process to Tesseract that includes “training” the system for your language, pre-processing your images, and more. But we are not the first to tackle this problem and a team of dedicated coders have created Tesseract.js that takes care of the deep dive stuff for us. You can run npm install tesseract.js to install this package, providing us access to the Tesseract API. This includes the necessary sub packages and does much of the setup work automatically when you call the process.

Tesseract有一个完整的过程，其中包括针对您的语言对系统进行“培训”，对图像进行预处理等。但是，我们并不是第一个解决此问题的人，并且由一群专门的编码人员创建了Tesseract.js，它可以为我们处理深层次的工作。您可以运行npm install tesseract.js来安装此软件包，从而使我们能够访问Tesseract API。这包括必要的子程序包，并且在您调用该过程时自动完成大部分设置工作。

With that we have enough to extract our text. We can get a list of the target files from our _inbox folder, and then extract text from each file. We will tackle this code as a separate package/script, just to keep things simple. Also, we will simply dump the extracted text to the console for now. We will make use of that information in our next posting on this topic.

这样，我们就可以提取文本了。我们可以从_inbox文件夹中获取目标文件的列表，然后从每个文件中提取文本。我们将以单独的程序包/脚本的形式处理此代码，只是为了使事情保持简单。另外，我们现在仅将提取的文本转储到控制台。在下一个有关此主题的文章中，我们将利用这些信息。

Setup our project with the following commands

使用以下命令设置我们的项目

mkdir myProject2
cd myProject 2
mkdir src
npm init -y
npm install tesseract.js

Now create the src/index.js file. Copy this code into it

现在创建src/index.js文件。将此代码复制到其中

const path = require('path')
const fs = require('fs')
const { createWorker } = require('tesseract.js')


// where is our _inbox directory (use an absolute path)
const INBOX_DIR = '/path/to/file_cabinet/_inbox'


// the extension for the files we will attempt to extract text from
const FILE_FILTER = '.jpg'


// Retrieve a list of the files in our _inbox directory
// Return an absolute path for each of the target files
function getFileList() {
  return new Promise((resolve, reject) => {
    fs.readdir(INBOX_DIR, (err, files) => {
      if (err) return reject(err)


      return resolve(
        files
          .filter(
            (file) =>
              path.extname(file).toLowerCase() === FILE_FILTER.toLowerCase()
          )
          .map((file) => path.resolve(INBOX_DIR, file))
      )
    })
  })
}


// Use the specified Tesseract worker to extract text from the specified image file.
// Return an object with the file name and the extracted text.
function extractText(worker, img) {
  return new Promise((resolve, reject) => {
    worker.recognize(img).then((results) => {
      resolve({
        file: img,
        text: results.data.text,
      })
    })
  })
}


// The "main" method that initiates the work
(async () => {
  // get a list of the target files
  const files = await getFileList().catch((err) => {
    throw err
  })


  // set up the Tesseract worker
  const worker = createWorker({
    // logger: (m) => console.log(m),
    errorHandler: (err) => console.log(err),
  })


  // load the worker
  await worker.load()


  // set the language Tesseract will use
  await worker.loadLanguage('eng')


  // initialize the worker with the desired language
  // This will download an `eng.traineddata` file.
  // Later calls to this routine will use the already downloaded file (if it exists)
  await worker.initialize('eng')


  // create an output variable to hold our data
  // ** this will likely be replaced when the classification step is implemented
  const output = []


  // loop over each of the files
  for (const file of files) {
    // indicate which file we are processing
    console.log(`extracting : ${file}`)


    // extract the text from the current file
    const results = await extractText(worker, file).catch((err) => {
      throw err
    })


    // store the resulting data into our output variable
    output.push(results)
  }


  // terminate the worker
  worker.terminate().then(() => {
    // then dump the output variable
    console.log({ output })


    // and exit our code
    process.exit(0)
  })
})()

We create two main functions to do the heavy lifting for us. First we get a list of all the files in our target directory. Then we extract the text from each file. We’ll elaborate on this step a little below as there are some gotchas here.

我们创建了两个主要功能来为我们完成繁重的工作。首先，我们获得目标目录中所有文件的列表。然后，我们从每个文件中提取文本。我们将在下面对此步骤进行详细说明，因为这里有些陷阱。

The getFiles() method does a simple readdir() to get a flat list of the contents of our directory. Then we filter that list to only include files with our target extension. And finally we use the .map() call to ensure we have an absolute path to the file.

getFiles()方法执行简单的readdir()来获取目录内容的平面列表。然后，我们过滤该列表以仅包括具有目标扩展名的文件。最后，我们使用.map()调用来确保我们拥有文件的绝对路径。

The text extraction routine is basically the sample code found in the Tesseract.js README file. We have tweaked this code though to handle each file one at a time, and avoid various errors that could result when the process is called multiple times in a parallel fashion.

文本提取例程基本上是Tesseract.js README文件中的示例代码。我们已经对该代码进行了调整，以便一次处理每个文件，避免了以并行方式多次调用该过程时可能导致的各种错误。

If we have multiple files in our folder, the extract routine may be called multiple times before the previous instance(s) have completed. It turns out this is a problem for the .loadLanguages() step. This step downloads the language training data (or uses the existing file that was previously downloaded). Because we are asynchronous it could be likely that we call the extract routine a second time, before the first training data is downloaded. In that case a second download would be started. And then when THAT download is completed the file already exists from the previous call. So we end up in a condition where file operations may throw errors.

如果我们的文件夹中有多个文件，则提取例程可能在以前的实例完成之前被调用多次。事实证明，这是.loadLanguages()步骤的问题。此步骤将下载语言培训数据(或使用以前下载的现有文件)。因为我们是异步的，所以很可能在下载第一个训练数据之前，第二次调用提取例程。在这种情况下，将开始第二次下载。然后，下载完成后，上一个调用中的文件已经存在。因此，我们最终遇到文件操作可能引发错误的情况。

To address this file operations issue we need to ensure that we set up the Tesseract worker once and only once. Then we can call the .recognize() method as many times as we need. The code above handles this by setting up the Tesseract worker, then using a for…of loop with an await call to our extractText() method. This ensures that extractText is only ever running a single time. As a result we process the files sequentially — one after another. My tests suggest parallel processing is possible, and would sped us up a little, but the file operations issues came into play here. I’ll leave it to the more adventurous to explore parallel processing.

为了解决此文件操作问题，我们需要确保只设置一次Tesseract worker。然后，我们可以根据需要多次调用.recognize()方法。上面的代码通过设置Tesseract worker来处理此问题，然后使用for…of循环并等待对我们的extractText()方法的调用。这样可以确保extractText只运行一次。结果，我们依次处理文件-一个接一个。我的测试表明并行处理是可能的，并且可以使我们花一些时间，但是文件操作问题在这里起作用。我将把更多的精力留给探索并行处理的人。

We can run this code with the command `node src/index.js` in our project directory. In my testing this was taking approximately 3.5 minutes to run for three test images. The output is a dump of the extracted text for each of our files. This output is not our final product, but it does give us what we need to move onto the Classification steps.

我们可以在项目目录中使用“ node src / index.js”命令运行此代码。在我的测试中，运行三张测试图像大约需要3.5分钟。输出是我们每个文件的提取文本的转储。此输出不是最终产品，但确实提供了我们进入分类步骤所需的内容。

{
      file: '/path/to/file_cabinet/_inbox/Scanned_from_a_Lexmark_Multifunction_Product09-15-2020-015639-1.jpg',
      text: 'Period 1 Packages\n' +
        'During the staggered entry days, all students will receive a new bell schedule, hard copy of their\n' +
        'timetable, and 2 reusable masks. We would ask that students wear their own mask to school to\n' +
        'begin and take the masks we give them home to wash prior to wearing them. Students are\n' +
        'asked to have a mask on when they come into the school tomorrow.\n' +
        'Staggered Entry Days\n' +
        'A reminder of our staggered entry day schedule is below:\n' +
        '\n' +
        'a. Wednesday, Sep 2", 2020 — Last Name A-G\n' +
        '\n' +
        'b. Thursday, Sep 39, 2020 - Last Name H- M\n' +
        '\n' +
        'c. Friday, Sep 4%, 2020 — Last Name N - Z\n' +
        'Please see the attached door map for the assigned entry point to the school. A reminder that\n' +
        'there will be directional signage in the school for students to follow to their period 1 classes.\n' +
        'School Board Scenario 1\n' +
        'All planning for safe re-entry at Anonymous High School has followed School Board’\n' +
        'Scenario 1 guide. Please read it for questions specific to what is allowed/not allowed in\n' +
        'classrooms and other important items. This plan is updated frequently so referring to the web\n' +
        'link is better than downloading the document.\n' +
        'Anonymous High School’s Daily Practices Template\n' +
        'We have been asked to fill in specific information about safe re-entry at Anonymous High\n' +
        'School using the Daily Practice Template. Please refer to the attached document for school\n' +
        'specific questions. In addition to the template we have created a 1-page quick reference guide\n' +
        'to highlight some more significant differences between last year and this year.\n' +
        'A reminder if you need to book an appointment at the school, please call 403 555 5555 to\n' +
        'speak with our office staff. All appointments are being done via phone currently.\n' +
        'Once again, thank you for your continued support and cooperation. This is a monumental task\n' +
        'when compared to a regular year coming back to school. It will take a united effort from staff,\n' +
        "students, and parents to have success. We're here to help as we transition back to school.\n" +
        'Have a great evening,\n' 
    }

Above we have an anonimized sample letter scanned and OCR’d with our code. (This was a letter sent to us regarding our children’s first day at school after/during the pandemic.) The text itself doesn’t matter here — its only purpose is to assist in determining where to place the image file, and perhaps how to name that file.

上方，我们扫描了一个经过匿名处理的样本信件，并使用我们的代码进行了OCR。 (这是给我们的一封信，内容涉及我们的孩子在大流行之后/流行期间在学校上学的第一天。)文本本身在这里并不重要-唯一的目的是帮助确定将图像文件放置在何处，以及如何确定命名该文件。

As you can see from the output, the text is very raw and does not include formatting or context of what each line is. It is even more confusing for an invoice or receipt. We can still use this though as the overall copy does contain the text we might use to classify the document. Receipts would have the name of the company listed, or some other identifying information. A policy may have a title or author mentioned. We can use these keywords to do our classification, or we can apply the whole text to an AI based system. At the very least we have a start here — we can now process text and maybe derive enough context to help identify the document.

从输出中可以看到，文本非常原始，不包括每行内容的格式或上下文。发票或收据更让人困惑。我们仍然可以使用它，因为整个副本确实包含了我们可能用来对文档进行分类的文本。收据上会列出公司的名称或其他一些识别信息。保单可能提到标题或作者。我们可以使用这些关键字进行分类，也可以将整个文本应用于基于AI的系统。至少我们从这里开始—我们现在可以处理文本，并可能派生足够的上下文来帮助标识文档。

In my tests, I noticed that Tesseract sometimes does not properly detect the text. With one document I was given “Frday” instead of “Friday”. And there were other issues — mostly related to an “i” or an ‘o’. But it has been only one document like this thus far. Still it is something to keep an eye on, and suggests that it may not always be possible to classify a document.

在测试中，我注意到Tesseract有时无法正确检测到文本。一份文件给我的是“星期五”，而不是“星期五”。还有其他问题-主要与“ i”或“ o”有关。到目前为止，它只是一个这样的文件。仍然需要密切注意，并建议不一定总是可以对文档进行分类。

We will leave our project at this stage though. We have extracted the text from each of the documents in our _inbox directory. In our next article we will examine how we could classify the documents. If you are still reading — thanks for your time and effort. I hope this project is proving useful to you.

不过，我们将在此阶段离开我们的项目。我们从_inbox目录中的每个文档中提取了文本。在下一篇文章中，我们将研究如何对文档进行分类。如果您仍在阅读-感谢您的时间和精力。我希望这个项目对您有用。

资源资源 (Resources)

Tesseract.js
Tesseract.js
Tesseract OCR
Tesseract OCR
Part 1— extracting the attachments from an IMAP based mailbox
第1部分-从基于IMAP的邮箱中提取附件

翻译自: https://medium.com/swlh/virtual-file-cabinet-part-2-9ce163c05abc

虚拟机与虚拟机传输文件

weixin_26722099

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
虚拟机与虚拟机传输文件_虚拟文件柜第2部分

虚拟机与虚拟机传输文件Before we can classify our scanned documents, we must extract the text from them.在对扫描的文档进行分类之前，我们必须从中提取文本。Update: Part 3 has been posted and ties it all together with a functioning applica...
复制链接

扫一扫