nodejs 解析http_如何在NodeJS中大规模解析PDF:做什么和不做什么

nodejs 解析http

by Tom

由汤姆

如何在NodeJS中大规模解析PDF:做什么和不做什么 (How to parse PDFs at scale in NodeJS: what to do and what not to do)

Take a step into program architecture, and learn how to make a practical solution for a real business problem with NodeJS Streams with this article.

踏入程序架构,并通过本文学习如何使用NodeJS Streams为实际的业务问题提供实用的解决方案。

绕道而行:流体力学 (A Detour: Fluid Mechanics)

One of the greatest strengths of software is that we can develop abstractions which let us reason about code, and manipulate data, in ways we can understand. Streams are one such class of abstraction.

软件的最大优势之一是,我们可以开发可以使我们以理解的方式推理代码并处理数据的抽象。 就是这样的抽象类。

In simple fluid mechanics, the concept of a streamline is useful for reasoning about the way fluid particles will move, and the constraints applied to them at various points in a system.

在简单的流体力学中, 流线的概念对于推理流体粒子的移动方式以及在系统中各个点施加的约束很有用。

For example, say you’ve got some water flowing through a pipe uniformly. Halfway down the pipe, it branches. Generally, the water flow will split evenly into each branch. Engineers use the abstract concept of a streamline to reason about the water’s properties, such as its flow rate, for any number of branches or complex pipeline configurations. If you asked an engineer what he assumed the flow rate through each branch would be, he would rightly reply with “one half”, intuitively. This expands out to an arbitrary number of streamlines mathematically.

例如,假设您有一些水均匀地流过管道。 在管道的一半处,它分支了。 通常,水流将平均分配到每个分支中。 工程师使用流线的抽象概念来针对任意数量的分支或复杂的管道配置推断水的性质(例如流量)。 如果您问工程师他假设通过每个分支的流量是多少,那么他会以直截了当的方式回答“一半”。 这在数学上扩展为任意数量的流线。

Streams, conceptually, are to code what streamlines are too fluid mechanics. We can reason about data at any given point by considering it as part of a flow. Rather than worrying about implementation details between how it’s stored. Arguably you could generalize this to some universal concept of a pipeline that we can use between disciplines. A sales funnel comes to mind but that’s tangential and we’ll cover it later. The best example of streams, and one you absolutely must familiarise yourself with if you haven’t already are UNIX pipes:

从概念上讲,流是要编码哪些流线过于流畅。 通过将数据视为流的一部分,我们可以对数据进行推理。 无需担心存储方式之间的实现细节。 可以说,您可以将其概括为一些我们可以在各学科之间使用的通用管道概念。 我想到了一个销售漏斗,但这是切线的,我们稍后会介绍。 流的最好的例子,如果您还没有UNIX管道,那么您绝对必须熟悉它:

cat server.log | grep 400 | less

We affectionately call the | character a pipe. Based on its function we’re piping the output of one program as the input of another program. Effectively setting up a pipeline.

我们亲切地称呼| 字符管道。 基于其功能,我们将一个程序的输出作为另一个程序的输入进行管道传输。 有效地建立管道。

(Also, it looks like a pipe.)

(而且,它看起来像个烟斗。)

If you’re like me and wonder at this point why this is necessary, ask yourself why we use pipelines in real life. Fundamentally, it’s a structure that eliminates storage between processing points. We don’t need to worry about storing barrels of oil if it’s pumped.

如果您像我一样,并想知道为什么这是必要的,请问自己为什么我们在现实生活中使用管道 。 从根本上说,它是消除处理点之间存储的结构。 如果抽油,我们不必担心会存储桶石油。

Go figure that in software. The clever developers and engineers who wrote the code for piping data set it up such that it never occupies too much memory on a machine. No matter how big the logfile is above, it won’t hang the terminal. The entire program is a process handling infinitesimal data points in a stream, rather than containers of those points. The logfile never gets loaded into memory all at once, but rather in manageable parts.

去图软件。 编写用于管道数据代码的聪明的开发人员和工程师将其设置为永远不会在机器上占用太多内存。 不管上面的日志文件有多大,它都不会挂起终端。 整个程序是一个处理流中无穷数据点的过程,而不是这些点的容器。 日志文件永远不会一次加载到内存中,而是一次性地加载到内存中。

I don’t want to reinvent the wheel here. So now that I’ve covered a metaphor for streams and the rationale for using them, Flavio Copes has a great blog post covering how they’re implemented in Node. Take as long as you need to cover the basics there, and when you’re ready come back and we’ll go over a use case.

我不想在这里重新发明轮子。 因此,既然我已经涵盖了流的隐喻和使用它们的原理,Flavio Copes上有一篇很棒的博客文章,介绍了如何在Node中实现它们。 只要您需要覆盖那里的基本知识,当您准备好回来时,我们将介绍一个用例。

情况 (The Situation)

So, now that you’ve got this tool in your toolbelt, picture this:

因此,既然您已经在工具栏中找到了此工具,请想象一下:

You’re on the job and your manager / legal / HR / your client / (insert stakeholder here) has approached you with a problem. They spend way too long poring over structured PDFs. Of course, normally people won’t tell you such a thing. You’ll hear, “I spend 4 hours doing data entry.” Or “I look through price tables.” Or, “I fill out the right forms so we get our company branded pencils every quarter”.

您正在上班,而您的经理/法律/人力资源/您的客户/(在此处插入利益相关者)已经遇到问题。 他们花太多时间研究结构化PDF。 当然,通常人们不会告诉你这样的事情。 您会听到“我花了4个小时来进行数据输入。” 或“我查看价格表”。 或者,“我填写正确的表格,以便我们每个季度获得我们公司的品牌铅笔”。

Whatever it is, if their work happens to involve both (a) the reading of structured PDF documents and (b) the bulk usage of that structured information. Then you can step in and say, “Hey, we might be able to automate that and free up your time to work on other things”.

无论是什么,如果他们的工作碰巧涉及(a)阅读结构化PDF文档和(b)大量使用该结构化信息。 然后,您可以介入并说:“嘿,我们可以自动执行此操作,并腾出您的时间来处理其他事情”。

So for the sake of this article, let’s come up with a dummy company. Where I come from, the term “dummy” refers to either an idiot or a baby’s pacifier. So let’s imagine up this fake company that manufactures pacifiers. While we’re at it let’s jump the shark and say they’re 3D printed. The company operates as an ethical supplier of pacifiers to the needy who can’t afford the premium stuff themselves.

因此,为了本文的方便,让我们提出一个虚拟公司。 我来自哪里,“假人”一词指的是白痴或婴儿的奶嘴。 因此,让我们想象一下这个制造奶嘴的假公司。 在此期间,让我们跳一下鲨鱼,说它们是3D打印的。 该公司是有道德的奶嘴供应商,专门为无法负担得起高级物品的有需要人士提供奶嘴。

(I know how dumb it sounds, suspend your disbelief please.)

(我知道这听起来很蠢,请暂停您的怀疑。)

Todd sources the printing materials that go into DummEth’s products, and has to ensure that they meet three key criteria:

Todd采购DummEth产品中使用的打印材料,并且必须确保它们符合三个关键标准:

  • they’re food-grade plastic, to preserve babies’ health,

    它们是食品级塑料,可以保护婴儿的健康,
  • they’re cheap, for economical production, and

    它们很便宜,用于经济生产,并且
  • they’re sourced as close as possible, to support the company’s marketing copy stating that their supply chain is also ethical and pollutes as little as possible.

    它们的来源尽可能接近,以支持公司的市场营销副本,声称它们的供应链也符合道德规范,并且污染尽可能少。

该项目 (The Project)

So it’s easier to follow along, I’ve set up a GitLab repo you can clone and use. Make sure your installations of Node and NPM are up to date too.

因此,后续操作更容易,我设置了一个可以克隆和使用的GitLab存储库 。 确保您的Node和NPM的安装也是最新的。

基本架构:约束 (Basic Architecture: Constraints)

Now, what are we trying to do? Let’s assume that Todd works well in spreadsheets, like a lot of office workers. For Todd to sort the proverbial 3D printing wheat from the chaff, it’s easier for him to gauge materials by food grade, price per kilogram, and location. It’s time to set some project constraints.

现在,我们要做什么? 假设Todd和许多办公室工作人员一样,在电子表格中的效果很好。 对于Todd从谷壳中筛选出著名的3D打印小麦而言,他更容易通过食品等级,每公斤价格和位置来衡量材料。 是时候设置一些项目约束了。

Let’s assume that a material’s food grade is rated on a scale from zero to three. With zero meaning banned-in-California BPA-rich plastics. Three meaning commonly used non-contaminating materials, like low density polyethylene. This is purely to simplify our code. In reality we’d have to somehow map textual descriptions of these materials (e.g.: “LDPE”) to a food grade.

假设材料的食品等级等级从零到三。 零表示在加利福尼亚禁止使用富含BPA的塑料。 三种常用的无污染材料,例如低密度聚乙烯。 这纯粹是为了简化我们的代码。 实际上,我们必须以某种方式将这些材料的文字描述(例如:“ LDPE”)映射到食品级。

Price per kilogram we can assume to be a property of the material given by its manufacturer.

每公斤价格可以假定是其制造商提供的材料的属性。

Location, we’re going to simplify and assume to be a simple relative distance, as the crow flies. At the opposite end of the spectrum there’s the overengineered solution: using some API (e.g.: Google Maps) to discern the rough travel distance a given material would travel to reach Todd’s distribution center(s). Either way, let’s say we’re given it as a value (kilometres-to-Todd) in Todd’s PDFs.

位置,我们将简化并假定它是一个简单的相对距离,因为乌鸦会飞。 在频谱的另一端,则是过度设计的解决方案:使用某些API(例如Google Maps)来识别给定材料到达Todd的配送中心所需要的粗略旅行距离。 无论哪种方式,假设我们在Todd的PDF文件中将其作为一个值(公里到托德)给出。

Also, let’s consider the context we’re working in. Todd effectively operates as an information gatherer in a dynamic market. Products come in and out, and their details can change. This means we’ve got an arbitrary number of PDFs that can change — or more aptly, be updated — at any time.

另外,让我们考虑一下我们所处的环境。Todd在动态市场中有效地充当了信息收集者的角色。 产品进出,其细节可能会改变。 这意味着我们有任意数量的PDF,它们可以随时更改(或更恰当地说是进行更新)。

So based on these constraints, we can finally figure out what we want our code to accomplish. If you’d like to test your design ability, pause here and consider how you’d structure your solution. It might not look the same as what I’m about to describe. That’s fine, as long as you’re providing a sane workable solution for Todd, and something you wouldn’t tear your hair out later trying to maintain.

因此,基于这些约束,我们最终可以弄清楚我们希望代码完成什么。 如果您想测试您的设计能力,请在此处暂停并考虑如何构建解决方案。 它

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值