lucene入门-解析pdf(使用xpdf解析中文PDF详细过程)

最新推荐文章于 2022-03-24 14:52:33 发布

deepfuture

最新推荐文章于 2022-03-24 14:52:33 发布

阅读量1.2k

点赞数

分类专栏：搜索引擎与人工智能文章标签： lucene string solaris encoding file layout

本文链接：https://blog.csdn.net/deepfuture/article/details/5063971

版权

搜索引擎与人工智能专栏收录该内容

217 篇文章 1 订阅

订阅专栏

下载xpdf和xpdf-chinese-simplified.tar.gz ，然后将xpdf-chinese-simplified.tar.gz解压到xpdf所在的目录形成一个子目录

http://www.foolabs.com/xpdf/download.html

The following packages are available:

Arabic: xpdf-arabic.tar.gz (1058 bytes)
Chinese/simplified: xpdf-chinese-simplified.tar.gz (835807 bytes)
Chinese/traditional: xpdf-chinese-traditional.tar.gz (794568 bytes)
Cyrillic: xpdf-cyrillic.tar.gz (1326 bytes)
Greek: xpdf-greek.tar.gz (1164 bytes)
Hebrew: xpdf-hebrew.tar.gz (1314 bytes)
Japanese: xpdf-japanese.tar.gz (494624 bytes)
Korean: xpdf-korean.tar.gz (470166 bytes)
Latin2: xpdf-latin2.tar.gz (1435 bytes)
Thai: xpdf-thai.tar.gz (1873 bytes)
Turkish: xpdf-turkish.tar.gz (1140 bytes)

中文包的配置说明

Xpdf: Chinese Simplified support package
========================================

Xpdf project: http://www.foolabs.com/xpdf/
2004-jul-27

If this package includes CMap files, they contain their own copyright
notices and distribution conditions. All other files in the package
are Copyright 2002-2004 Glyph & Cog, LLC, and are licensed under the
GNU General Public License (GPL), version 2.

This package provides support files needed to use the Xpdf tools with
Chinese (Simplified) PDF files.

Contents:
- Adobe-GB1 character collection support
- ISO-2022-CN encoding
- EUC-CN encoding
- GBK encoding

Place all of these files in a directory, typically:

Unix - /usr/local/share/xpdf/chinese-simplified
Win32 - C:/Program Files/xpdf/chinese-simplified

Add the contents of the "add-to-xpdfrc" file to your system-wide
xpdfrc config file, which is typically:

Unix - /usr/local/etc/xpdfrc
Win32 - C:/Program Files/xpdf/xpdfrc

Alternatively, on Unix systems you can add these lines to your
personal xpdfrc file in $HOME/.xpdfrc.

能运行以下平台中

Precompiled binaries are available for the following machines:

x86, Linux (staticly linked to Motif, t1lib, and FreeType):
xpdf-3.02pl4-linux.tar.gz (11985186 bytes)
SPARC, Solaris 10 (staticly linked to t1lib and FreeType):
not currently available
x64, Solaris 10 (staticly linked to t1lib and FreeType):
not currently available
x86, DOS/Win32 -- pdftops, pdftotext, pdfimages, pdfinfo, and pdffonts only:
Win32 (built with MSVC): xpdf-3.02pl4-win32.zip (2046671e bytes)
DOS6 (built with djgpp, with DPMI support from csdpmi5b): xpdf-3.02pl4-dos6.zip (1754621 bytes)

I've received reports of xpdf compiling successfully on the following systems (but binaries are not available on the net):

x86 and MIPS, SINIX V5.4 (email f.miane@opengroup.org for binaries) (xpdf 0.5)
Apollo 425e, DomainOS 10.4.1.2 (xpdf 0.5)
m68k (HP-9000/425), HP-UX 9.0 (xpdf 0.5)
Alpha, Linux (xpdf 0.7)
POWER, AIX 4.2.1, gcc 2.8.1 (xpdf 0.7a)
UltraSPARC 2, Linux 2.2.5 (xpdf 0.80)
SPARC, Solaris 2.7, gcc 2.8.1 (xpdf 0.90)
DG/UX (xpdf 0.90)
LynxOS 2.5.1 (xpdf 0.90)
HP-UX 10.20 and 11.00 (xpdf 0.90)
MacOS X / Darwin (xpdf 0.92)
QNX / X11 (xpdf 0.93)
x86, OpenBSD 3.0 (xpdf 1.00)
MacOS X / Darwin (xpdf 2.03)

,xpdf比pdfbox适应性更强，既能解析英文PDF，也能解析包括中文在内的PDF，但是XPDF实际上是在命令行运行

下面是在命令行运行，解析英文PDF后的效果

命令如下：

D:/workspace/testsearch2/xpdf>pdftotext ../htmls/xxxx.pdf xxxx.txt

编辑xpdfrc文件

cidToUnicode Adobe-GB1 D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/Adobe-GB1.cidToUnicode
unicodeMap ISO-2022-CN D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/ISO-2022-CN.unicodeMap
unicodeMap EUC-CN D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/EUC-CN.unicodeMap
unicodeMap GBK D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/GBK.unicodeMap
cMapDir Adobe-GB1 D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/CMap
toUnicodeDir D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/CMap

fontDir c:/windows/Fonts
displayCIDFontTT Adobe-GB1 c:/windows/fonts/SimHei.ttf

textEOL dos
在LINUX下可以查看add-to-xpdfrc文档，将该文档内容复制到xpdfrc中

解析中文PDF，需要加参数（同样的参数-enc GBK也能解析英文文档）

D:/workspace/testsearch2/xpdf>pdftotext -layout -enc GBK ../htmls/readme.pdf

效果如下：

主要参数如下:

OPTIONS
       Many of the following options can be set with configuration file com-
       mands. These are listed in square brackets with the description of the
       corresponding command line option.

-f number
Specifies the first page to convert.

-l number
Specifies the last page to convert.

       -layout
              Maintain (as best as possible) the original physical layout of
              the text. The default is to 'undo' physical layout (columns,
              hyphenation, etc.) and output the text in reading order.

       -fixed number
              Assume fixed-pitch (or tabular) text, with the specified charac-
              ter width (in points). This forces physical layout mode.

       -raw   Keep the text in content stream order. This is a hack which
              often "undoes" column formatting, etc. Use of raw mode is no
              longer recommended.

       -htmlmeta
              Generate a simple HTML file, including the meta information.
              This simply wraps the text in <pre> and </pre> and prepends the
              meta headers.

-enc encoding-name

简体中文包只包含下面三种语言

ISO-2022-CN
EUC-CN

GBK

              Sets the encoding to use for text output.   The encoding-name
              must be defined with the unicodeMap command (see xpdfrc(5)).
              The encoding name is case-sensitive. This defaults to "Latin1"
              (which is a built-in encoding). [config file: textEncoding]

       -eol unix | dos | mac
              Sets the end-of-line convention to use for text output. [config
              file: textEOL]

       -nopgbrk
              Don't insert page breaks (form feed characters) between pages.
              [config file: textPageBreaks]

       -opw password
              Specify the owner password for the PDF file. Providing this
              will bypass all security restrictions.

-upw password
Specify the user password for the PDF file.

-q Don't print any messages or errors. [config file: errQuiet]

       -cfg config-file
              Read config-file in place of ~/.xpdfrc or the system-wide config
              file.

-v Print copyright and version information.

-h Print usage information. (-help and --help are equivalent.)

下面我们使用JAVA将命令行包装起来形成一个类

package extract;

import java.io.*;

public class ExtractorCJKPDF {

/**
* @param args
*/

public static void pdf2text(String pdffile,String txtfile) throws IOException{

  String pdfname=pdffile;
  String txtname=txtfile;
  String xpdfpath="D:/workspace/testsearch2/xpdf/";
  String[] cmd=new String[]{xpdfpath+"pdftotext","-layout","-enc","GBK","-nopgbrk",pdfname,txtname};
  //-layout表示保持原有的layout，enc指定字符集，-nopgbrk指定不分页
  Process p=Runtime.getRuntime().exec(cmd);
}
public static void main(String[] args) {
  // TODO Auto-generated method stub
  try {
   pdf2text("D:/workspace/testsearch2/htmls/123.pdf","D:/workspace/testsearch2/htmls/123.txt");
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
}

}
效果如下：