nodejs 解析http_如何在NodeJS中大规模解析PDF:做什么和不做什么

本文介绍了如何使用NodeJS的Stream处理大规模PDF解析,通过流体力学的隐喻来理解流的概念。作者创建了一个项目,帮助处理结构化的PDF文档,通过读取数据、重组信息,并利用NodeJS的Stream来提高效率。文章强调了使用Stream避免内存占用过多的重要性,并提供了项目实现的代码示例和注意事项,展示了如何在实际业务中应用这一技术。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

nodejs 解析http

by Tom

由汤姆

如何在NodeJS中大规模解析PDF:做什么和不做什么 (How to parse PDFs at scale in NodeJS: what to do and what not to do)

Take a step into program architecture, and learn how to make a practical solution for a real business problem with NodeJS Streams with this article.

踏入程序架构,并通过本文学习如何使用NodeJS Streams为实际的业务问题提供实用的解决方案。

绕道而行:流体力学 (A Detour: Fluid Mechanics)

One of the greatest strengths of software is that we can develop abstractions which let us reason about code, and manipulate data, in ways we can understand. Streams are one such class of abstraction.

软件的最大优势之一是,我们可以开发可以使我们以理解的方式推理代码并处理数据的抽象。 就是这样的抽象类。

In simple fluid mechanics, the concept of a streamline is useful for reasoning about the way fluid particles will move, and the constraints applied to them at various points in a system.

在简单的流体力学中, 流线的概念对于推理流体粒子的移动方式以及在系统中各个点施加的约束很有用。

For example, say you’ve got some water flowing through a pipe uniformly. Halfway down the pipe, it branches. Generally, the water flow will split evenly into each branch. Engineers use the abstract concept of a streamline to reason about the water’s properties, such as its flow rate, for any number of branches or complex pipeline configurations. If you asked an engineer what he assumed the flow rate through each branch would be, he would rightly reply with “one half”, intuitively. This expands out to an arbitrary number of streamlines mathematically.

例如,假设您有一些水均匀地流过管道。 在管道的一半处,它分支了。 通常,水流将平均分配到每个分支中。 工程师使用流线的抽象概念来针对任意数量的分支或复杂的管道配置推断水的性质(例如流量)。 如果您问工程师他假设通过每个分支的流量是多少,那么他会以直截了当的方式回答“一半”。 这在数学上扩展为任意数量的流线。

Streams, conceptually, are to code what streamlines are too fluid mechanics. We can reason about data at any given point by considering it as part of a flow. Rather than worrying about implementation details between how it’s stored. Arguably you could generalize this to some universal concept of a pipeline that we can use between disciplines. A sales funnel comes to mind but that’s tangential and we’ll cover it later. The best example of streams, and one you absolutely must familiarise yourself with if you haven’t already are UNIX pipes:

从概念上讲,流是要编码哪些流线过于流畅。 通过将数据视为流的一部分,我们可以对数据进行推理。 无需担心存储方式之间的实现细节。 可以说,您可以将其概括为一些我们可以在各学科之间使用的通用管道概念。 我想到了一个销售漏斗,但这是切线的,我们稍后会介绍。 流的最好的例子,如果您还没有UNIX管道,那么您绝对必须熟悉它:

cat server.log | grep 400 | less

We affectionately call the | character a pipe. Based on its function we’re piping the output of one program as the input of another program. Effectively setting up a pipeline.

我们亲切地称呼| 字符管道。 基于其功能,我们将一个程序的输出作为另一个程序的输入进行管道传输。 有效地建立管道。

(Also, it looks like a pipe.)

(而且,它看起来像个烟斗。)

If you’re like me and wonder at this point why this is necessary, ask yourself why we use pipelines in real life. Fundamentally, it’s a structure that eliminates storage between processing points. We don’t need to worry about storing barrels of oil if it’s pumped.

如果您像我一样,并想知道为什么这是必要的,请问自己为什么我们在现实生活中使用管道 。 从根本上说,它是消除处理点之间存储的结构。 如果抽油,我们不必担心会存储桶石油。

Go figure that in software. The clever developers and engineers who wrote the code for piping data set it up such that it never occupies too much memory on a machine. No matter how big the logfile is above, it won’t hang the terminal. The entire program is a process handling infinitesimal data points in a stream, rather than containers of those points. The logfile never gets loaded into memory all at once, but rather in manageable parts.

去图软件。 编写用于管道数据代码的聪明的开发人员和工程师将其设置为永远不会在机器上占用太多内存。 不管上面的日志文件有多大,它都不会挂起终端。 整个程序是一个处理流中无穷数据点的过程,而不是这些点的容器。 日志文件永远不会一次加载到内存中,而是一次性地加载到内存中。

I don’t want to reinvent the wheel here. So now that I’ve covered a metaphor for streams and the rationale for using them, Flavio Copes has a great blog post covering how they’re implemented in Node. Take as long as you need to cover the basics there, and when you’re ready come back and we’ll go over a use case.

我不想在这里重新发明轮子。 因此,既然我已经涵盖了流的隐喻和使用它们的原理,Flavio Copes上有一篇很棒的博客文章,介绍了如何在Node中实现它们。 只要您需要覆盖那里的基本知识,当您准备好回来时,我们将介绍一个用例。

情况 (The Situation)

So, now that you’ve got this tool in your toolbelt, picture this:

因此,既然您已经在工具栏中找到了此工具,请想象一下:

You’re on the job and your manager / legal / HR / your client / (insert stakeholder here) has approached you with a problem. They spend way too long poring over structured PDFs. Of course, normally people won’t tell you such a thing. You’ll hear, “I spend 4 hours doing data entry.” Or “I look through price tables.” Or, “I fill out the right forms so we get our company branded pencils every quarter”.

您正在上班,而您的经理/法律/人力资源/您的客户/(在此处插入利益相关者)已经遇到问题。 他们花太多时间研究结构化PDF。 当然,通常人们不会告诉你这样的事情。 您会听到“我花了4个小时来进行数据输入。” 或“我查看价格表”。 或者,“我填写正确的表格,以便我们每个季度获得我们公司的品牌铅笔”。

Whatever it is, if their work happens to involve both (a) the reading of structured PDF documents and (b) the bulk usage of that structured information. Then you can step in and say, “Hey, we might be able to automate that and free up your time to work on other things”.

无论是什么,如果他们的工作碰巧涉及(a)阅读结构化PDF文档和(b)大量使用该结构化信息。 然后,您可以介入并说:“嘿,我们可以自动执行此操作,并腾出您的时间来处理其他事情”。

So for the sake of this article, let’s come up with a dummy company. Where I come from, the term “dummy” refers to either an idiot or a baby’s pacifier. So let’s imagine up this fake company that manufactures pacifiers. While we’re at it let’s jump the shark and say they’re 3D printed. The company operates as an ethical supplier of pacifiers to the needy who can’t afford the premium stuff themselves.

因此,为了本文的方便,让我们提出一个虚拟公司。 我来自哪里,“假人”一词指的是白痴或婴儿的奶嘴。 因此,让我们想象一下这个制造奶嘴的假公司。 在此期间,让我们跳一下鲨鱼,说它们是3D打印的。 该公司是有道德的奶嘴供应商,专门为无法负担得起高级物品的有需要人士提供奶嘴。

(I know how dumb it sounds, suspend your disbelief please.)

(我知道这听起来很蠢,请暂停您的怀疑。)

Todd sources the printing materials that go into DummEth’s products, and has to ensure that they meet three key criteria:

Todd采购DummEth产品中使用的打印材料,并且必须确保它们符合三个关键标准:

  • they’re food-grade plastic, to preserve babies’ health,

    它们是食品级塑料,可以保护婴儿的健康,
  • they’re cheap, for economical production, and

    它们很便宜,用于经济生产,并且
  • they’re sourced as close as possible, to support the company’s marketing copy stating that their supply chain is also ethical and pollutes as little as possible.

    它们的来源尽可能接近,以支持公司的市场营销副本,声称它们的供应链也符合道德规范,并且污染尽可能少。

该项目 (The Project)

So it’s easier to follow along, I’ve set up a GitLab repo you can clone and use. Make sure your installations of Node and NPM are up to date too.

因此,后续操作更容易,我设置了一个可以克隆和使用的GitLab存储库 。 确保您的Node和NPM的安装也是最新的。

基本架构:约束 (Basic Architecture: Constraints)

Now, what are we trying to do? Let’s assume that Todd works well in spreadsheets, like a lot of office workers. For Todd to sort the proverbial 3D printing wheat from the chaff, it’s easier for him to gauge materials by food grade, price per kilogram, and location. It’s time to set some project constraints.

现在,我们要做什么? 假设Todd和许多办公室工作人员一样,在电子表格中的效果很好。 对于Todd从谷壳中筛选出著名的3D打印小麦而言,他更容易通过食品等级,每公斤价格和位置来衡量材料。 是时候设置一些项目约束了。

Let’s assume that a material’s food grade is rated on a scale from zero to three. With zero meaning banned-in-California BPA-rich plastics. Three meaning commonly used non-contaminating materials, like low density polyethylene. This is purely to simplify our code. In reality we’d have to somehow map textual descriptions of these materials (e.g.: “LDPE”) to a food grade.

假设材料的食品等级等级从零到三。 零表示在加利福尼亚禁止使用富含BPA的塑料。 三种常用的无污染材料,例如低密度聚乙烯。 这纯粹是为了简化我们的代码。 实际上,我们必须以某种方式将这些材料的文字描述(例如:“ LDPE”)映射到食品级。

Price per kilogram we can assume to be a property of the material given by its manufacturer.

每公斤价格可以假定是其制造商提供的材料的属性。

Location, we’re going to simplify and assume to be a simple relative distance, as the crow flies. At the opposite end of the spectrum there’s the overengineered solution: using some API (e.g.: Google Maps) to discern the rough travel distance a given material would travel to reach Todd’s distribution center(s). Either way, let’s say we’re given it as a value (kilometres-to-Todd) in Todd’s PDFs.

位置,我们将简化并假定它是一个简单的相对距离,因为乌鸦会飞。 在频谱的另一端,则是过度设计的解决方案:使用某些API(例如Google Maps)来识别给定材料到达Todd的配送中心所需要的粗略旅行距离。 无论哪种方式,假设我们在Todd的PDF文件中将其作为一个值(公里到托德)给出。

Also, let’s consider the context we’re working in. Todd effectively operates as an information gatherer in a dynamic market. Products come in and out, and their details can change. This means we’ve got an arbitrary number of PDFs that can change — or more aptly, be updated — at any time.

另外,让我们考虑一下我们所处的环境。Todd在动态市场中有效地充当了信息收集者的角色。 产品进出,其细节可能会改变。 这意味着我们有任意数量的PDF,它们可以随时更改(或更恰当地说是进行更新)。

So based on these constraints, we can finally figure out what we want our code to accomplish. If you’d like to test your design ability, pause here and consider how you’d structure your solution. It might not look the same as what I’m about to describe. That’s fine, as long as you’re providing a sane workable solution for Todd, and something you wouldn’t tear your hair out later trying to maintain.

因此,基于这些约束,我们最终可以弄清楚我们希望代码完成什么。 如果您想测试您的设计能力,请在此处暂停并考虑如何构建解决方案。 它可能看起来与我要描述的内容不同。 很好,只要您为Todd提供理智可行的解决方案,并且在以后尝试进行维护时就不会掉头发。

基本架构:解决方案 (Basic Architecture: Solutions)

So we’ve got an arbitrary number of PDFs, and some rules for how to parse them. Here’s how we can do it:

因此,我们有任意数量的PDF,以及一些解析它们的规则。 这是我们的方法:

  1. Set up a Stream object that can read from some input. Like a HTTP client requesting PDF downloads. Or a module we’ve written that reads PDF files from a directory in the file system.

    设置可以从某些输入读取的Stream对象。 就像HTTP客户端请求PDF下载一样。 或者是我们编写的模块,该模块从文件系统中的目录读取PDF文件。
  2. Set up an intermediary Buffer. This is like the waiter in a restaurant delivering a finished dish to its intended customer. Every time a full PDF gets passed into the stream, we flush those chunks into the buffer so it can be transported.

    设置一个中间缓冲区 。 这就像一家餐厅的服务员向其目标客户交付成品菜。 每次将完整的PDF传递到流中时,我们会将这些块刷新到缓冲区中以便可以传输它。

  3. The waiter (Buffer) delivers the food (PDF data) to the customer (our Parsing function). The customer does what they please (convert to some spreadsheet format) with it.

    服务员(缓冲区)将食物(PDF数据)交付给客户(我们的解析功能)。 客户用它来做他们想要的(转换为电子表格格式)。
  4. When the customer (Parser) is done, let the waiter (Buffer) know that they’re free and can work on new orders (PDFs).

    当客户(解析器)完成后,让服务员(缓冲区)知道他们有空,可以处理新订单(PDF)。

You’ll notice that there’s no clear end to this process. As a restaurant, our Stream-Buffer-Parser combo never finishes, until of course there’s no more data — no more orders — coming in.

您会注意到该过程没有明确的结束。 作为一家餐厅,我们的Stream-Buffer-Parser组合永远都不会结束,当然,直到没有更多数据输入—没有更多订单。

Now I know there’s not a lick of code just yet. This is crucial. It’s important to be able to reason about our systems prior to writing them. Now, we won’t get everything right the first time even with a priori reasoning. Things always break in the wild. Bugs need to be fixed.

现在我知道还没有代码。 这很关键。 在编写系统之前,必须先对我们的系统进行推理,这一点很重要。 现在,即使有先验推理,我们也不会在第一时间就解决所有问题。 事情总是在野外破裂。 错误需要修复。

That said, it’s a powerful exercise in restraint and foresight to plan out your code prior to writing it. If you can simplify systems of increasing complexity into manageable parts and analogies, you’ll be able to increase your productivity exponentially, as the cognitive stress from those complexities fades into well-designed abstractions.

也就是说,在编写代码之前先计划好代码,这是一种有约束力和远见的强大练习。 如果您可以将复杂性不断提高的系统简化为可管理的部分和类比,那么您将能够成倍地提高生产率,因为来自那些复杂性的认知压力逐渐变为设计良好的抽象。

So in the grand scheme of things, it looks something like this:

因此,在总体方案中,它看起来像这样:

引入依赖性 (Introducing Dependencies)

Now as a disclaimer, I should add that there is a whole world of thought around introducing dependencies into your code. I’d love to cover this concept in another post. In the meantime let me just say that one of the fundamental conflicts at play is the one between our desire to get our work done quickly (i.e.: to avoid NIH syndrome), and our desire to avoid third-party risk.

现在,作为免责声明,我应该补充一下,关于将依赖项引入您的代码中,存在着各种各样的想法。 我很想在另一篇文章中介绍这个概念。 同时,我只想说,发生的根本冲突之一是我们渴望快速完成工作(即避免NIH综合征 )与我们希望避免第三方风险之间的冲突。

Applying this to our project, I opted to offload the bulk of our PDF processing to the pdfreader module. Here are a few reasons why:

将其应用于我们的项目后,我选择将大部分PDF处理工作卸载到pdfreader模块。 原因如下:

  • It was published recently, which is a good sign that the repo is up-to-date.

    它是最近发布的,这表明该回购是最新的。
  • It has one dependency — that is, it’s just an abstraction over another module — which is regularly maintained on GitHub. This alone is a great sign. Moreover, the dependency, a module called pdf2json, has hundreds of stars, 22 contributors, and plenty of eyeballs keeping a close eye on it.

    它具有一个依赖关系-也就是说,它只是对另一个模块的抽象-定期在GitHub上维护。 单单这是一个好兆头。 此外,该依赖项(名为pdf2json的模块)具有数百个明星,22个贡献者和众多关注者。

  • The maintainer, Adrian Joly, does good bookkeeping in GitHub’s issue tracker and actively tends to users and developers’ questions.

    维护者Adrian Joly在GitHub的问题跟踪器中记录良好,并积极地关注用户和开发人员的问题。

  • When auditing via NPM (6.4.1), no vulnerabilities are found.

    通过NPM(6.4.1)进行审核时,未发现漏洞。

So all in all, it seems like a safe dependency to include.

因此,总的来说,包含它似乎是一个安全的依赖项。

Now, the module works in a fairly straightforward way, although its README doesn’t explicitly describe the structure of its output. The cliff notes:

现在,该模块以一种非常直接的方式工作,尽管其自述文件未明确描述其输出的结构。 悬崖笔记:

  1. It exposes the PdfReader class to be instantiated

    它公开了要实例化的PdfReader

  2. This instance has two methods for parsing a PDF. They return the same output and only differ in the input: PdfReader.parseFileItems for a filename, and PdfReader.parseBuffer from data that we don’t want to reference from the filesystem.

    该实例有两种解析PDF的方法。 他们返回相同的输出,仅在输入不同: PdfReader.parseFileItems一个文件名,并PdfReader.parseBuffer从数据中,我们不希望从文件系统基准。

  3. The methods ask for a callback, which gets called each time the PdfReader finds what it denotes as a PDF item. There are three kinds. First, the file metadata, which is always the first item. Second is page metadata. It acts as a carriage return for the coordinates of text items to be processed. Last is text items which we can think of as simple objects / structs with a text property, and floating-point 2D AABB coordinates on the page.

    这些方法要求回调,每次PdfReader找到它表示为PDF项时都会调用该回调。 有三种。 首先是文件元数据,它始终是第一项。 其次是页面元数据。 它充当要处理的文本项的坐标的回车符。 最后是文本项,我们可以将其视为具有text属性的简单对象/结构,并在页面上使用浮点2D AABB坐标。

  4. It’s up to our callback to process these items into a data structure of our choice and also to handle any errors thrown to it.

    由我们的回调将这些项目处理成我们选择的数据结构,并处理抛出的所有错误。

Here’s a code snippet as an example:

这是一个代码片段作为示例:

const { PdfReader } = require('pdfreader');
// Initialise the readerconst reader = new PdfReader();
// Read some arbitrarily defined bufferreader.parseBuffer(buffer, (err, item) =>; {
if (err)    console.error(err);
else if (!item)    /* pdfreader queues up the items in the PDF and passes them to     * the callback. When no item is passed, it's indicating that     * we're done reading the PDF. */    console.log('Done.');
else if (item.file)    // File items only reference the PDF's file path.    console.log(`Parsing ${item.file && item.file.path || 'a buffer'}`)
else if (item.page)    // Page items simply contain their page number.    console.log(`Reached page ${item.page}`);
else if (item.text) {
// Text items have a few more properties:    const itemAsString = [      item.text,      'x: ' + item.x,      'y: ' + item.y,      'w: ' + item.width,      'h: ' + item.height,    ].join('\n\t');
console.log('Text Item: ', itemAsString);
}
});

托德的PDF文件 (Todd’s PDFs)

Let’s return to the Todd situation, just to provide some context. We want to store the data pacifiers based on three key criteria:

让我们回到Todd的情况,只是提供一些上下文。 我们要基于三个关键条件存储数据奶嘴:

  • their food-grade, to preserve babies’ health,

    他们的食品级,以维护婴儿的健康,
  • their cost, for economical production, and

    经济生产的成本,以及
  • their distance to Todd, to support the company’s marketing copy stating that their supply chain is also ethical and pollutes as little as possible.

    他们与Todd的距离,以支持公司的市场营销,表明他们的供应链也符合道德规范,并且污染尽可能少。

I’ve hardcoded a simple script that randomizes some dummy products, and you can find it in the /data directory of the companion repo for this project. That script writes that randomized data to JSON files.

我已经对一个简单的脚本进行了硬编码,该脚本将一些虚拟产品随机化,您可以在此项目的配套存储库的/ data目录中找到它。 该脚本会将随机数据写入JSON文件。

There’s also a template document in there. If you’re familiar with templating engines like Handlebars, then you’ll understand this. There are online services — or if you’re feeling adventurous, you can roll your own — that take JSON data and fill in the template, and give it back to you as a PDF. Maybe for completeness’ sake, we can try that out in another project. Anyway: I’ve used such a service to generate the dummy PDFs we’ll be parsing.

那里也有一个模板文件。 如果您熟悉诸如Handlebars之类的模板引擎,那么您将了解这一点。 有在线服务(或者,如果您喜欢冒险,可以自己滚动),这些服务会接收JSON数据并填写模板,然后以PDF的形式提供给您。 也许出于完整性考虑,我们可以在另一个项目中进行尝试。 无论如何:我已经使用了这样的服务来生成我们将要解析的虚拟PDF。

Here’s what one looks like (extra whitespace has been cropped out):

这是一个什么样子(多余的空格已被裁剪掉):

We’d like to yield from this PDF some JSON that gives us:

我们想从此PDF中产生一些JSON,为我们提供:

  • the requisition ID and date, for bookkeeping purposes,

    出于簿记目的,请购单ID和日期,
  • the SKU of the pacifier, for unique identification, and

    奶嘴的SKU(用于唯一标识),以及
  • the pacifier’s properties (name, food grade, unit price, and distance), so Todd can actually use them in his work.

    奶嘴的属性(名称,食品等级,单价和距离),因此Todd可以在其工作中实际使用它们。

How do we do this?

我们如何做到这一点?

读取数据 (Reading the Data)

First let’s set up the function for reading data out of one of these PDFs, and extracting pdfreader’s PDF items into a usable data structure. For now, let’s have an array representing the document. Each item in the array is an object representing a collection of all text elements on the page at that object’s index. Each property in the page object has a y-value for its key, and an array of the text items found at that y-value for its value. Here’s the diagram, so it’s simpler to understand:

首先,让我们设置功能以从其中一个PDF中读取数据,并将pdfreader的PDF项提取到可用的数据结构中。 现在,让我们有一个表示文档的数组。 数组中的每个项目都是一个对象,表示该对象索引处页面上所有文本元素的集合。 页面对象中的每个属性的键都有一个y值,在该y值处有一个文本项数组作为其值。 这是图表,因此更容易理解:

The readPDFPages function in /parser/index.js handles this, similarly to the example code written above:

readPDFPages中的readPDFPages函数可以解决此问题,类似于上面编写的示例代码:

/* Accepts a buffer (e.g.: from fs.readFile), and parses * it as a PDF, giving back a usable data structure for * application-specific, second-level parsing. */function readPDFPages (buffer) {  const reader = new PdfReader();
// We're returning a Promise here, as the PDF reading  // operation is asynchronous.  return new Promise((resolve, reject) =>; {
// Each item in this array represents a page in the PDF    let pages = [];
reader.parseBuffer(buffer, (err, item) =>; {
if (err)        // If we've got a problem, eject!        reject(err)
else if (!item)        // If we're out of items, resolve with the data structure        resolve(pages);
else if (item.page)        // If the parser's reached a new page, it's time to        // work on the next page object in our pages array.        pages.push({});
else if (item.text) {
// If we have NOT got a new page item, then we need        // to either retrieve or create a new "row" array        // to represent the collection of text items at our        // current Y position, which will be this item's Y        // position.
// Hence, this line reads as,        // "Either retrieve the row array for our current page,        //  at our current Y position, or make a new one"        const row = pages[pages.length-1][item.y] || [];
// Add the item to the reference container (i.e.: the row)        row.push(item.text);
// Include the container in the current page        pages[pages.length-1][item.y] = row;
}
});  });
}

So now passing a PDF buffer into that function, we’ll get some organized data. Here’s what I got from a test run, and printing it to JSON:

因此,现在将PDF缓冲区传递给该函数,我们将获得一些有组织的数据。 这是我从测试运行中得到的结果,并将其打印为JSON:

[ { '3.473': [ 'PRODUCT DETAILS REQUISITION' ],    '4.329': [ 'Date: 23/05/2019' ],    '5.185': [ 'Requsition ID: 298831' ],    '6.898': [ 'Pacifier Tech', 'Todd Lerr' ],    '7.754': [ '123 Example Blvd', 'DummEth Pty. Ltd.' ],    '8.61': [ 'Timbuktu', '1337 Leet St' ],    '12.235': [ 'SKU', '6308005' ],    '13.466': [ 'Product Name', 'Square Lemon Qartz Pacifier' ],    '14.698': [ 'Food Grade', '3' ],    '15.928999999999998': [ '$ / kg', '1.29' ],    '17.16': [ 'Location', '55' ] } ]

If you look carefully you’ll notice that there’s a spelling error in the original PDF. “Requisition” is misspelled as “Requsition”. The beauty of our parser is that we don’t particularly care for errors like these in our input documents. As long as they’re structured correctly, we can extract data from them accurately.

如果仔细看,您会发现原始PDF中存在拼写错误。 “请购单”拼写为“ Requsition”。 解析器的优点在于,我们并不特别关心输入文档中的此类错误。 只要它们的结构正确,我们就可以准确地从中提取数据。

Now we just need to organize this into something a bit more usable (as if we’d expose it via API). The structure we’re looking for is something along the lines of this:

现在,我们只需要将其组织成更有用的东西(就像我们通过API公开它一样)。 我们正在寻找的结构与此类似:

{  reqID: '000000',  date: 'DD/MM/YYYY', // Or something else based on geography  sku: '000000',  name: 'Some String We Have Trimmed',  foodGrade: 'X',  unitPrice: 'D.CC',  // D for Dollars, C for Cents  location: 'XX',}
旁白:数据完整性 (An Aside: Data Integrity)

Why are we including the numbers as strings? It’s based on the risk of parsing. Let’s just say that we coerced all of our numbers to strings:

为什么我们将数字包括为字符串? 它基于解析的风险。 假设我们将所有数字都强制转换为字符串:

The unit price and location would be fine — they are supposed to be countable numbers after all.

单价和位置都很好-毕竟它们应该是可数的。

The food grade, for this very limited project, technically is safe. No data gets lost when we coerce it — but if it’s effectively a classifier, like an Enum, so it’s better off kept as a string.

对于这个非常有限的项目,食品级在技术上是安全的。 当我们强制使用数据时,不会丢失任何数据,但是如果它实际上是一个分类器(如枚举),那么最好将其保存为字符串。

The requisition ID and SKU however, if coerced to strings, could lose important data. If the ID for a given requisition starts with three zeros and we coerce that to a number, well, we’ve just lost those zeros and we’ve garbled the data.

但是,如果将请求ID和SKU强制为字符串,则可能会丢失重要数据。 如果给定申请的ID以三个零开头,然后我们将其强制为一个数字,那么,我们只是丢失了这些零,并且乱码了数据。

So because we want data integrity when reading the PDFs, we just leave everything as a String. If the application code wants to convert some fields to numbers to make them usable for arithmetic or statistical operations, then we’ll let the coercion occur at that layer. Here we just want something that parses PDFs consistently and accurately.

因此,由于我们在读取PDF时需要数据完整性,因此我们将所有内容都保留为String。 如果应用程序代码希望将某些字段转换为数字,以使其可用于算术或统计运算,那么我们将强制在该层进行。 在这里,我们只需要能够一致且准确地解析PDF的内容。

重组数据 (Restructuring the Data)

So now we’ve got Todd’s information, we just need to organize it in a usable way. We can use a variety of array and object manipulation functions, and here MDN is your friend.

因此,现在我们有了Todd的信息,我们只需要以一种可用的方式来组织它即可。 我们可以使用各种数组和对象操作函数,在这里MDN是您的朋友。

This is the step where everyone has their own preferences. Some prefer the method that just gets the job done and minimizes dev time. Others prefer to scout for the best algorithm for the job (e.g.: cutting down iteration time). It’s a good exercise to see if you can come up with a way to do this and compare it to what I got. I’d love to see better, simpler, faster, or even just different ways to accomplish the same goal.

这是每个人都有自己的偏好的步骤。 有些人更喜欢只完成工作并最大程度减少开发时间的方法。 其他人则喜欢为工作找到最佳算法(例如:减少迭代时间)。 这是一个很好的练习,看看您是否可以提出一种方法来将其与我得到的进行比较。 我希望看到更好,更简单,更快甚至什至是实现同一目标的不同方法。

Anyway, here’s how I did it: the parseToddPDF function in /parser/index.js.

总之,这里的我是怎么做的:在parseToddPDF在功能上/parser/index.js

function parseToddPDF (pages) {
const page = pages[0]; // We know there's only going to be one page
// Declarative map of PDF data that we expect, based on Todd's structure  const fields = {    // "We expect the reqID field to be on the row at 5.185, and the    //  first item in that array"    reqID: { row: '5.185', index: 0 },    date: { row: '4.329', index: 0 },    sku: { row: '12.235', index: 1 },    name: { row: '13.466', index: 1 },    foodGrade: { row: '14.698', index: 1 },    unitPrice: { row: '15.928999999999998', index: 1 },    location: { row: '17.16', index: 1 },  };
const data = {};
// Assign the page data to an object we can return, as per  // our fields specification  Object.keys(fields)    .forEach((key) =>; {
const field = fields[key];      const val = page[field.row][field.index];
// We don't want to lose leading zeros here, and can trust      // any application / data handling to worry about that. This is      // why we don't coerce to Number.      data[key] = val;
});
// Manually fixing up some text fields so they're usable  data.reqID = data.reqID.slice('Requsition ID: '.length);  data.date = data.date.slice('Date: '.length);
return data;
}

The meat and potatoes here is in the forEach loop, and how we’re using it. After retrieving the Y positions of each text item previously, it’s simple to specify each field we want as a position in our pages object. Effectively providing a map to follow.

这里的肉和土豆都在forEach循环中,以及我们如何使用它。 在先前检索每个文本项的Y位置之后,很容易将我们想要的每个字段指定为我们的pages对象中的位置。 有效地提供了可遵循的地图。

All we have to do then is declare a data object to output, iterate over each field we specified, follow the route as per our spec, and assign the value we find at the end to our data object.

然后,我们要做的就是声明要输出的数据对象,遍历我们指定的每个字段,按照我们的规范遵循路由,然后将最后找到的值分配给数据对象。

After a few one-liners to tidy up some string fields, we can return the data object and we’re off to the races. Here’s what it looks like:

经过一线整理一些字符串字段后,我们可以返回数据对象,然后开始比赛。 看起来是这样的:

{ reqID: '298831',  date: '23/05/2019',  sku: '6308005',  name: 'Square Lemon Qartz Pacifier',  foodGrade: '3',  unitPrice: '1.29',  location: '55' }

全部放在一起 (Putting it all together)

Now we’ll move on to building out some concurrency for this parsing module so we can operate at scale, and recognize some important barriers to doing so. The diagram above is great for understanding the context of the parsing logic. It doesn’t do much for understanding how we’re going to parallelize it. We can do better:

现在,我们将继续为该解析模块建立一些并发性,以便我们可以进行大规模操作,并认识到这样做的一些重要障碍。 上图非常适合理解解析逻辑的上下文。 它对于了解我们如何使其并行化并没有多大作用。 我们可以做得更好:

Trivial, I know, and arguably way too textbook-y generalized for us to practically use, but hey, it’s a fundamental concept to formalize.

我知道,琐碎的琐事也可以说是过于泛泛而已,无法推广到我们实际使用中,但是,嘿,这是形式化的基本概念。

Now first and foremost we need to think about how we’re going to handle the input and output of our program, which will essentially be wrapping the parsing logic and then distributing it amongst parser worker processes. There are many questions we can ask here and many solutions:

现在最重要的是,我们首先需要考虑如何处理程序的输入和输出,这实际上是包装解析逻辑,然后将其分配到解析器工作进程之间。 我们可以在这里提出许多问题和解决方案:

  • is it going to be a command line application?

    将成为命令行应用程序吗?
  • Is it going to be a consistent server, with a set of API endpoints? This has its own host of questions — REST or GraphQL, for example?

    它将是具有一组API端点的一致服务器吗? 这有它自己的一系列问题-例如REST或GraphQL?
  • Maybe it’s just a skeleton module in a broader codebase — for example, what if we generalized our parsing across a suite of binary documents and wanted to separate the concurrency model from the particular source file type and parsing implementation?

    也许这只是更广泛的代码库中的骨架模块-例如,如果我们对一组二进制文档进行通用化的解析,并希望将并发模型与特定的源文件类型和解析实现分开,该怎么办?

For simplicity’s sake, I’m going to wrap the parsing logic in a command-line utility. This means it’s time to make a bunch of assumptions:

为简单起见,我将解析逻辑包装在命令行实用程序中。 这意味着是时候做出一堆假设了:

  • does it expect file paths as input, and are they relative or absolute?

    它期望文件路径作为输入,它们是相对还是绝对?
  • Or instead, does it expect concatenated PDF data, to be piped in?

    或者,它是否期望将串联的PDF数据通过管道传输?
  • Is it going to output data to a file? Because if it is, then we’re going to have to provide that option as an argument for the user to specify…

    是否要将数据输出到文件? 因为如果是这样,那么我们将不得不提供该选项作为用户指定的参数。

处理命令行输入 (Handling Command Line Input)

Again, keeping things as simple as possible: I’ve opted for the program to expect a list of file paths, either as individual command line arguments:

同样,使事情尽可能简单:我选择让程序期望文件路径列表,或者作为单独的命令行参数:

node index file-1.pdf file-2.pdf … file-n.pdf

Or piped to standard input as a newline-separated list of file paths:

或通过管道以换行符分隔的形式通过管道传递到标准输入:

# read lines from a text file with all our pathscat files-to-parse.txt | node index# or perhaps just list them from a directoryfind ./data -name “*.pdf” | node index

This allows the Node process to manipulate the order of those paths in any way it sees fit, which allows us to scale the processing code later. To do this, we’re going to read the list of file paths, whichever way they were provided, and divvy them up by some arbitrary number into sub-lists. Here’s the code, the getTerminalInput method in ./input/index.js:

这允许Node进程以其认为合适的任何方式来操纵这些路径的顺序,从而使我们能够在以后扩展处理代码。 为此,我们将读取文件路径的列表(以提供路径的方式),并将它们按任意数量划分为子列表。 下面的代码中, getTerminalInput的方法./input/index.js

function getTerminalInput (subArrays) {
return new Promise((resolve, reject) =>; {
const output = [];      if (process.stdin.isTTY) {
const input = process.argv.slice(2);
const len = Math.min(subArrays, Math.ceil(input.length / subArrays));
while (input.length) {        output.push(input.splice(0, len));      }
resolve(output);
} else {          let input = '';      process.stdin.setEncoding('utf-8');
process.stdin.on('readable', () => {        let chunk;        while (chunk = process.stdin.read())          input += chunk;      });
process.stdin.on('end', () => {        input = input.trim().split('\n');
const len = Math.min(input.length, Math.ceil(input.length / subArrays));
while (input.length) {          output.push(input.splice(0, len));        }
resolve(output);      })        }      });
}

Why divvy up the list? Let’s say that you have an 8-core CPU on consumer-grade hardware, and 500 PDFs to parse.

为什么要分榜? 假设您在消费级硬件上有一个8核CPU,并且要解析500个PDF。

Unfortunately for Node, even though it handles asynchronous code fantastically thanks to its event loop, it only runs on one thread. To process those 500 PDFs, if you’re not running multithreaded (i.e.: multiple process) code, you’re only using an eighth of your processing capacity. Assuming that memory efficiency isn’t a problem, you could process the data up to eight times faster by taking advantage of Node’s built-in parallelism modules.

对于Node来说不幸的是,尽管由于事件循环它可以出色地处理异步代码,但它只能在一个线程上运行。 要处理这500个PDF,如果您没有运行多线程(即多进程)代码,则仅使用了八分之一的处理能力。 假设内存效率不是问题,则可以利用Node的内置并行模块,将数据处理速度提高八倍。

Splitting up our input into chunks allows us to do that.

将我们的输入分成多个部分,就可以做到这一点。

As an aside, this is essentially a primitive load balancer and clearly assumes that the workloads presented by parsing each PDF are interchangeable. That is, that the PDFs are the same size and hold the same structure.

顺便说一句,这本质上是一个原始的负载平衡器,并且清楚地假设解析每个PDF所呈现的工作负载是可互换的。 也就是说,PDF具有相同的大小并具有相同的结构。

This is obviously a trivial case, especially since we’re not taking into account error handling in worker processes and which worker is currently available to handle new loads. In the case where we would have set up an API server to handle incoming parsing requests, we would have to consider these extra needs.

这显然是一个琐碎的情况,尤其是因为我们没有考虑工作进程中的错误处理以及当前可以使用哪些工作程序来处理新负载。 如果我们要设置一个API服务器来处理传入的解析请求,则必须考虑这些额外的需求。

集群我们的代码 (Clustering our code)

Now that we have our input split into manageable workloads, admittedly in a contrived way — I’d love to refactor this later — let’s go over how we can cluster it. So it turns out Node has two separate modules for setting up parallel code.

现在我们已经将输入分为可管理的工作负载,诚然以一种人为的方式(我希望稍后进行重构),让我们研究如何将其集群。 因此,事实证明Node有两个用于设置并行代码的独立模块。

The one we’re going to use, the cluster module, basically allows a Node process to spawn copies of itself and balance processing between them as it sees fit.

我们将要使用的一个集群模块基本上允许Node进程生成其自身的副本并在它们认为合适时平衡它们之间的处理。

This is built on top of the child_process module, which is less tightly coupled with parallelizing Node programs themselves and allows you to spawn other processes, like shell programs or another executable binary, and interface with them using standard input, output, et cetera.

它建立在child_process模块的顶部, child_process模块与并行化Node程序本身紧密耦合,并允许您生成其他进程(如Shell程序或其他可执行二进制文件),并使用标准输入,输出等与它们进行接口。

I highly recommend reading through the API docs for each module, since they’re fantastically written, and even if you’re like me and find purposeless manual reading boring and total busy-work, at least familiarise yourself with the introductions to each module will help you ground yourself in the topic and expand your knowledge of the Node ecosystem.

强烈建议您仔细阅读每个模块的API文档,因为它们写得很出色,即使您像我一样,发现无目的的手动阅读无聊又费劲,至少要熟悉每个模块的介绍,帮助您以主题为基础,并扩展您对Node生态系统的了解。

So let’s walk through the code. Here it is in bulk:

因此,让我们遍历代码。 这里是批量的:

const cluster = require('cluster');const numCPUs = require('os').cpus().length;
const { getTerminalInput } = require('./input');
(async function main () {
if (cluster.isMaster) {
const workerData = await getTerminalInput(numCPUs);
for (let i = 0; i < workerData.length; i++) {
const worker = cluster.fork();      const params = { filenames: workerData[i] };
worker.send(params);
}
} else {
require('./worker');
}
})();

So our dependencies are pretty simple. First, there’s the cluster module as described above. Second, we’re requiring the os module for the express purpose of figuring out how many CPU cores there are on our machine — which is a fundamental parameter of splitting up our workload. Finally, there’s our input handling function which I’ve outsourced to another file for completeness’ sake.

因此,我们的依赖关系非常简单。 首先,有如上所述的集群模块。 其次,出于明确目的,我们需要os模块,以确定我们的计算机上有多少个CPU内核-这是分配工作量的基本参数。 最后,为了完整起见,我们将输入处理功能外包给了另一个文件。

Now the main method is actually rather simple. In fact, we could break it down into steps:

现在,主要方法实际上相当简单。 实际上,我们可以将其分解为以下步骤:

  1. If we’re the main process, split up the input sent to us evenly per the number of CPU cores for this machine

    如果是主进程,则按照该计算机的CPU内核数平均分配发送给我们的输入
  2. For each worker-to-be’s load, spawn a worker by cluster.fork and set up an object which we can send to it by the [cluster] module’s inter-process RPC message channel, and send the damn thing to it.

    对于每个将要工作的工作人员,通过cluster.fork产生一个工作人员,并设置一个对象,我们可以通过[cluster]模块的进程间RPC消息通道将其发送给该对象,并将该死的对象发送给该对象。

  3. If we’re not in fact the main module, then we must be a worker — just run the code in our worker file and call it a day.

    如果实际上我们不是主模块,那么我们必须是工作人员-只需在我们的工作人员文件中运行代码并每天调用它即可。

Nothing crazy is going on here, and it allows us to focus on the real lifting, which is figuring out how the worker is going to use the list of filenames we give to it.

这里没有什么疯狂的事情,它使我们能够专注于真正的提升,这正在弄清楚工作人员将如何使用我们提供给它的文件名列表。

消息传递,异步和流,这是营养饮食的所有要素 (Messaging, Async, and Streams, all the elements of a nutritious diet)

First, as above let me dump the code for you to refer to. Trust me, looking through it first will let you skip through any explanation you’d consider trivial.

首先,如上所述,让我转储代码供您参考。 相信我,先阅读它会让您跳过所有您认为无关紧要的解释。

const Bufferer = require('../bufferer');const Parser = require('../parser');const { createReadStream } = require('fs');
process.on('message', async (options) =>; {
const { filenames } = options;  const parser = new Parser();
const parseAndLog = async (buf) => console.log(await parser.parse(buf) + ',');
const parsingQueue = filenames.reduce(async (result, filename) =>; {
await result;
return new Promise((resolve, reject) =>; {
const reader = createReadStream(filename);      const bufferer = new Bufferer({ onEnd: parseAndLog });
reader        .pipe(bufferer)        .once('finish', resolve)        .once('error', reject)        });    }, true);
try {    await parsingQueue;    process.exit(0);  } catch (err) {    console.error(err);    process.exit(1);  }
});

Now there are some dirty hacks in here so be careful if you’re one of the uninitiated (only joking). Let’s look at what happens first:

现在这里有一些肮脏的骇客,所以如果您是一个没有经验的人(只是在开玩笑),请当心。 让我们看看首先会发生什么:

Step one is to require all the necessary ingredients. Mind you, this is based on what the code itself does. So let me just say we’re going to use a custom-rolled Writable stream I’ve endearingly termed Bufferer, a wrapper for our parsing logic from last time, also intricately named, Parser, and good old reliable createReadStream from the fs module.

第一步是需要所有必要的成分。 请注意,这是基于代码本身的功能。 因此,我只想说说,我们将使用一个自定义滚动的可写流,该流被我称为Bufferer,这是上次解析逻辑的包装,它也被复杂地命名为Parser和fs模块中的旧式可靠createReadStream。

Now here’s where the magic happens. You’ll notice that nothing’s actually wrapped in a function. The entire worker code is just waiting for a message to come to the process — the message from its master with the work it has to do for the day. Excuse the medieval language.

现在这就是魔术发生的地方。 您会注意到函数中实际上没有包裹任何东西。 整个工作人员代码只是在等待消息进入流程-来自其主管的消息以及其当天要做的工作。 对不起中世纪的语言。

So we can see first of all that it’s asynchronous. First, we extract the filenames from the message itself — if this were production code I’d be validating them here. Actually, hell, I’d be validating them in our input processing code earlier. Then we instantiate our parsing object — only one for the whole process — this is so we can parse multiple buffers with one set of methods. A concern of mine is that it’s managing memory internally, and on reflection, this is a good thing to review later.

因此,我们首先可以看到它是异步的。 首先,我们从消息本身中提取文件名-如果这是生产代码,我将在此处对其进行验证。 实际上,我要早些在我们的输入处理代码中验证它们。 然后,我们实例化解析对象(整个过程中只有一个),因此我们可以使用一组方法解析多个缓冲区。 我的一个担心是它在内部管理内存,并且在反思时,这是以后回顾的一件好事。

Then there’s a simple wrapper, parseAndLog around parsing that logs the JSON-ified PDF buffer with a comma appended to it, just to make life easier for concatenating the results of parsing multiple PDFs.

然后在解析周围有一个简单的包装parseAndLog ,它记录了JSON格式的PDF缓冲区,并附加了逗号,以简化连接多个PDF的结果。

Finally the meat of the matter, the asynchronous queue. Let me explain:

最后是问题的实质,异步队列。 让我解释:

This worker’s received its list of filenames. For each filename (or path, really), we need to open a readable stream through the filesystem so we can get the PDF data. Then, we need to spawn our Bufferer, (our waiter, following along from the restaurant analogy earlier), so we can transport the data to our Parser.

该工作人员已收到其文件名列表。 对于每个文件名(或真正的路径),我们需要通过文件系统打开一个可读流,以便获取PDF数据。 然后,我们需要产生我们的Bufferer(我们的服务员,从前面的餐厅类推而来),以便将数据传输到Parser。

The Bufferer is custom-rolled. All it really does is accept a function to call when it’s received all the data it needs — here we’re just asking it to parse and log that data.

缓冲区是自定义滚动的。 它真正要做的就是接受一个函数,当它接收到所有需要的数据时就调用它–在这里,我们只是要求它解析和记录这些数据。

So, now we have all the pieces, we just pipe them together:

所以,现在我们有了所有的片段,我们将它们通过管道连接在一起:

  1. The readable stream — the PDF file, pipes to the Bufferer

    可读流— PDF文件,通过管道传输到Bufferer
  2. The Bufferer finishes and calls our worker-wide parseAndLog method

    缓冲程序完成并调用我们整个工作范围内的parseAndLog方法

This entire process is wrapped in a Promise, which itself is returned to the reduce function it sits inside. When it resolves, the reduce operation continues.

整个过程都包装在Promise中,该Promise本身又返回到它所在的reduce函数中。 解决后,还原操作将继续。

This asynchronous queue is actually a really useful pattern, so I’ll cover it in more detail in my next post, which will probably be more bite-sized than the last few.

这个异步队列实际上是一个非常有用的模式,因此我将在下一篇文章中更详细地介绍它,它可能会比前几篇更详细。

Anyway, the rest of the code just ends the process based on error-handling. Again, if this were production code, you can bet there’d be more robust logging and error handling here, but as a proof of concept, this seems alright.

无论如何,其余代码只是基于错误处理而结束了该过程。 同样,如果这是生产代码,那么您可以肯定,这里会有更健壮的日志记录和错误处理,但是作为概念证明,这似乎还不错。

这样就可以了,但是有用吗? (So it works, but is it useful?)

So there you have it. It was a bit of a journey, and it certainly works, but like any code, it’s important to review what its strengths and weaknesses are. Off the top of my head:

所以你有它。 这是一个漫长的旅程,并且确实有效,但是像任何代码一样,重要的是要检查其优点和缺点。 从我的头顶上:

  • Streams have to be piled up in buffers. This, unfortunately, defeats the purpose of using streams, and memory efficiency suffer accordingly. This is a necessary duct-tape-fit to work with the pdfreader module. I’d love to see if there’s a way to stream PDF data and parse it on a finer-grained level. Especially if modular, functional parsing logic can still be applied to it.

    流必须堆积在缓冲区中。 不幸的是,这破坏了使用流的目的,并且存储器效率因此受到损害。 这是与pdfreader模块一起使用所必需的胶带安装。 我很想看看是否有一种方法可以流传输PDF数据并在更细粒度的层次上对其进行解析。 特别是如果仍然可以将模块化,功能性解析逻辑应用于该模块。

  • In this baby stage, the parsing logic is also annoyingly brittle. Just think, what if I have a document that’s longer than a page? A bunch of assumptions fly out the window and make the need for streaming PDF data even stronger.

    在这个婴儿阶段,解析逻辑也很烦人。 试想一下,如果我的文档长于一页怎么办? 一系列假设飞出了窗外,使对流PDF数据的需求更加强烈。
  • Finally, it would be great to see how we could build out this functionality with logging and API endpoints to provide to the public — for a price, or pro bono, depending on the contexts in which it’s used.

    最后,很高兴看到我们如何利用日志记录和API端点来构建此功能,以向公众提供价格或价格(取决于其使用的上下文)。

If you’ve got any specific criticisms or concerns I’d love to hear them too, since spotting weaknesses in the code are the first step to fixing them. And, if you’re aware of any better method to streaming and parsing PDFs concurrently, let me know so I can leave it here for anyone reading through this post for an answer. Either way — or for any other purpose — send me an email or get in touch on Reddit.

如果您有任何特定的批评或疑虑,我也希望听到他们的意见,因为发现代码中的弱点是修复它们的第一步。 而且,如果您知道同时流式传输和解析PDF的任何更好的方法,请告诉我,这样我可以将其留在这里,供任何阅读这篇文章的人作为答案。 无论哪种方式(或出于任何其他目的),请给我发送电子邮件或与Reddit联系

翻译自: https://www.freecodecamp.org/news/how-to-parse-pdfs-at-scale-in-nodejs-what-to-do-and-what-not-to-do-541df9d2eec1/

nodejs 解析http

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值