pdf 复制文本乱码_如何在保留格式的同时从PDF复制文本？

最新推荐文章于 2021-06-03 05:08:02 发布

culintai3473

最新推荐文章于 2021-06-03 05:08:02 发布

阅读量1.3k

点赞数

文章标签： python java 大数据正则表达式字符串

原文链接：https://www.howtogeek.com/136698/how-can-i-copy-text-from-a-pdf-while-preserving-the-formatting/

版权

pdf 复制文本乱码

PDF, the ubiquitous document format, is great for sharing documents while preserving fonts, images, and the general layout across platforms. Is there an easy way, however, to preserve that very formatting when copying and pasting text out of the document?

PDF是无处不在的文档格式，非常适合共享文档，同时保留跨平台的字体，图像和总体布局。但是，在从文档中复制和粘贴文本时，是否有一种简单的方法来保留这种格式？

Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.

今天的“问答”环节由SuperUser提供，它是Stack Exchange的一个分支，该社区是由社区驱动的Q＆A网站分组。

问题 (The Question)

SuperUser reader Colen is searching for a way to extract text from PDFs while preserving the formatting:

超级用户阅读器Colen正在寻找一种在保留格式的同时从PDF提取文本的方法：

When I copy text out of a PDF file and into a text editor, it ends up mangled in a variety of ways. Formatting like bold and italics are lost; soft line breaks within a paragraph of text are converted to hard line breaks; dashes to break a word over two lines are preserved even when they shouldn’t be; and single and double quotes are replaced with ? signs.

当我将文本从PDF文件复制到文本编辑器中时，它最终会以各种方式被破坏。像粗体和斜体这样的格式会丢失；文本段落中的软换行符转换为硬换行符；即使在不应该使用破折号的情况下也保留了两行破折号；单引号和双引号替换为？迹象。

Ideally, I’d like to be able to copy text from a PDF and have formatting converted to HTML codes, “smart quotes” converted to ” and ‘, and line breaks done properly. Is there any way to do this?

理想情况下，我希望能够从PDF复制文本，并将格式转换为HTML代码，将“智能引号”转换为“和”，并正确完成换行符。有什么办法吗？

Is there a quick and easy way for Colen (and the rest of us) to get grab text without sacrificing the formatting?

Colen(还有我们其他人)是否有一种快速简便的方法来获取抓取文本而不牺牲格式？

答案 (The Answer)

SuperUser contributor Frabjous offers a solution combined with a heavy dose of caution:

超级用户贡献者Frabjous提供了一种解决方案，并需要特别注意：

Firstly, you have to understand what a PDF is. PDFs are designed to mimic a printed page, and they are designed only as an output format, not an input format. a PDF is basically a map containing the exact location of characters (individual letters or punctuation, etc.) or images. In most cases, a PDF does not even store information about where one word ends and another begins, much less things like soft breaks vs. hard breaks for paragraph endings.

首先，您必须了解什么是PDF。 PDF旨在模仿打印的页面，并且它们仅被设计为输出格式，而不是输入格式。 PDF基本上是一张包含字符(各个字母或标点符号等)或图像的确切位置的地图。在大多数情况下，PDF甚至不存储有关一个单词的结尾和另一个单词的开头的信息，少了诸如段落结尾的软中断与硬中断之类的信息。

(A few recent PDFs do store some information about this stuff, but that’s a new technology, and you’d be lucky to find PDFs like that. Even if you did, your PDF viewer might not know about it.)

(最近的一些PDF确实存储了有关此内容的一些信息，但这是一项新技术，您很幸运能够找到这样的PDF。即使您这样做，您的PDF查看器也可能不知道它。)

Anyway, it’s up to your software to implement some kind of “artificial intelligence” to extract merely from the locations of individual characters what is a word, what is a paragraph, and so on. Different software is going to do this better than others, and it’s also going to depend on how the PDF was made. In any case, you should never expect perfect results. Having the output PDF is not the same as having the source document. Far better to try to obtain that if you can.

无论如何，要由软件来实现某种“人工智能”，以仅从单个字符的位置提取什么是单词，什么是段落等。不同的软件将比其他软件做得更好，而且还取决于PDF的制作方式。无论如何，您永远都不应期望获得完美的结果。具有输出PDF与具有源文档是不同的。如果可以的话，尝试获得更好的选择。

The standard solution to your kind of problem is to use Adobe Acrobat Professional (the expensive one, not the free reader) to convert the PDF to HTML. Even that is not going to get perfect results.

解决此类问题的标准方法是使用Adobe Acrobat Professional(价格昂贵，而不是免费的阅读器)将PDF转换为HTML。即使那样也不会取得完美的结果。

There is free software that can be used to extract text from PDFs with some of formatting intact, but again, don’t expect perfect results. See, e.g., calibre (which can convert to RTF format), pdftohtml/pdfreflow, or the AbiWord word processor (with all import/export plugins enabled). There’s also a PDF import plugin for OpenOffice.

有一些免费软件可用于从PDF中提取格式完整的文本，但同样，不要指望完美的结果。请参见例如口径(可以转换为RTF格式) ， pdftohtml / pdfreflow或AbiWord文字处理器 (启用所有导入/导出插件)。还有一个用于OpenOffice的PDF导入插件。

But please don’t expect perfection with any of these results. You’re going against the grain here. PDF just is not meant as an editable input format.

但是，请不要指望这些结果中的任何一个都是完美的。你在这里反对谷物。 PDF并不意味着它是可编辑的输入格式。

If you are having trouble deciding which tool to start with, Calibre is a veritable document Swiss Army knife. You can also use it to convert PDF files for use on your ebook reader and organize your ebook/document library.

如果您在决定使用哪种工具时遇到麻烦，Calibre是名副其实的瑞士军刀。您还可以使用它来转换PDF文件以在电子书阅读器上使用，以及整理电子书/文档库。

Have something to add to the explanation? Sound off in the the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.

有什么补充说明吗？在评论中听起来不对。是否想从其他精通Stack Exchange的用户那里获得更多答案？在此处查看完整的讨论线程。