lucene入门-解析pdf(使用xpdf解析中文PDF详细过程)

下载xpdf和xpdf-chinese-simplified.tar.gz ,然后将xpdf-chinese-simplified.tar.gz解压到xpdf所在的目录形成一个子目录

http://www.foolabs.com/xpdf/download.html

The following packages are available:

中文包的配置说明

Xpdf: Chinese Simplified support package
========================================

Xpdf project: http://www.foolabs.com/xpdf/
2004-jul-27

If this package includes CMap files, they contain their own copyright
notices and distribution conditions.  All other files in the package
are Copyright 2002-2004 Glyph & Cog, LLC, and are licensed under the
GNU General Public License (GPL), version 2.

This package provides support files needed to use the Xpdf tools with
Chinese (Simplified) PDF files.

Contents:
- Adobe-GB1 character collection support
- ISO-2022-CN encoding
- EUC-CN encoding
- GBK encoding

Place all of these files in a directory, typically:

    Unix - /usr/local/share/xpdf/chinese-simplified
    Win32 - C:/Program Files/xpdf/chinese-simplified

Add the contents of the "add-to-xpdfrc" file to your system-wide
xpdfrc config file, which is typically:

    Unix - /usr/local/etc/xpdfrc
    Win32 - C:/Program Files/xpdf/xpdfrc

Alternatively, on Unix systems you can add these lines to your
personal xpdfrc file in $HOME/.xpdfrc.

能运行以下平台中

Precompiled binaries are available for the following machines:

  • x86, Linux (staticly linked to Motif, t1lib, and FreeType):
    xpdf-3.02pl4-linux.tar.gz (11985186 bytes)
  • SPARC, Solaris 10 (staticly linked to t1lib and FreeType):
    not currently available
  • x64, Solaris 10 (staticly linked to t1lib and FreeType):
    not currently available
  • x86, DOS/Win32 -- pdftops, pdftotext, pdfimages, pdfinfo, and pdffonts only:
    Win32 (built with MSVC): xpdf-3.02pl4-win32.zip (2046671e bytes)
    DOS6 (built with djgpp, with DPMI support from csdpmi5b): xpdf-3.02pl4-dos6.zip (1754621 bytes)

 

 

I've received reports of xpdf compiling successfully on the following systems (but binaries are not available on the net):

  • x86 and MIPS, SINIX V5.4 (email f.miane@opengroup.org for binaries) (xpdf 0.5)
  • Apollo 425e, DomainOS 10.4.1.2 (xpdf 0.5)
  • m68k (HP-9000/425), HP-UX 9.0 (xpdf 0.5)
  • Alpha, Linux (xpdf 0.7)
  • POWER, AIX 4.2.1, gcc 2.8.1 (xpdf 0.7a)
  • UltraSPARC 2, Linux 2.2.5 (xpdf 0.80)
  • SPARC, Solaris 2.7, gcc 2.8.1 (xpdf 0.90)
  • DG/UX (xpdf 0.90)
  • LynxOS 2.5.1 (xpdf 0.90)
  • HP-UX 10.20 and 11.00 (xpdf 0.90)
  • MacOS X / Darwin (xpdf 0.92)
  • QNX / X11 (xpdf 0.93)
  • x86, OpenBSD 3.0 (xpdf 1.00)
  • MacOS X / Darwin (xpdf 2.03)

 

 

 

,xpdf比pdfbox适应性更强,既能解析英文PDF,也能解析包括中文在内的PDF,但是XPDF实际上是在命令行运行

下面是在命令行运行,解析英文PDF后的效果

命令如下:

D:/workspace/testsearch2/xpdf>pdftotext ../htmls/xxxx.pdf xxxx.txt

 

 

编辑xpdfrc文件

cidToUnicode Adobe-GB1 D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/Adobe-GB1.cidToUnicode
unicodeMap ISO-2022-CN D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/ISO-2022-CN.unicodeMap
unicodeMap EUC-CN D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/EUC-CN.unicodeMap
unicodeMap GBK D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/GBK.unicodeMap
cMapDir Adobe-GB1 D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/CMap
toUnicodeDir D:/workspace/testsearch2/xpdf/xpdf-chinese-simplified/CMap

fontDir c:/windows/Fonts
displayCIDFontTT Adobe-GB1 c:/windows/fonts/SimHei.ttf

textEOL dos
在LINUX下可以查看add-to-xpdfrc文档,将该文档内容复制到xpdfrc中

解析中文PDF,需要加参数(同样的参数-enc GBK也能解析英文文档)

D:/workspace/testsearch2/xpdf>pdftotext -layout -enc GBK  ../htmls/readme.pdf

效果如下:

 

 

主要参数如下:

OPTIONS
       Many  of  the following options can be set with configuration file com-
       mands.  These are listed in square brackets with the description of the
       corresponding command line option.

       -f number
              Specifies the first page to convert.

       -l number
              Specifies the last page to convert.

       -layout
              Maintain  (as  best as possible) the original physical layout of
              the text.  The default is to 'undo'  physical  layout  (columns,
              hyphenation, etc.) and output the text in reading order.

       -fixed number
              Assume fixed-pitch (or tabular) text, with the specified charac-
              ter width (in points).  This forces physical layout mode.

       -raw   Keep the text in content stream order.  This  is  a  hack  which
              often  "undoes"  column  formatting, etc.  Use of raw mode is no
              longer recommended.

       -htmlmeta
              Generate a simple HTML file,  including  the  meta  information.
              This  simply wraps the text in <pre> and </pre> and prepends the
              meta headers.

       -enc encoding-name

简体中文包只包含下面三种语言

 ISO-2022-CN
EUC-CN 

 GBK


              Sets the encoding to use for  text  output.   The  encoding-name
              must  be  defined  with  the unicodeMap command (see xpdfrc(5)).
              The encoding name is case-sensitive.  This defaults to  "Latin1"
              (which is a built-in encoding).  [config file: textEncoding]

       -eol unix | dos | mac
              Sets the end-of-line convention to use for text output.  [config
              file: textEOL]

       -nopgbrk
              Don't insert page breaks (form feed characters)  between  pages.
              [config file: textPageBreaks]

       -opw password
              Specify  the  owner  password  for the PDF file.  Providing this
              will bypass all security restrictions.

       -upw password
              Specify the user password for the PDF file.

       -q     Don't print any messages or errors.  [config file: errQuiet]

       -cfg config-file
              Read config-file in place of ~/.xpdfrc or the system-wide config
              file.

       -v     Print copyright and version information.

       -h     Print usage information.  (-help and --help are equivalent.)

 下面我们使用JAVA将命令行包装起来形成一个类

package extract;

import java.io.*;

 

public class ExtractorCJKPDF {

 /**
  * @param args
  */

 public static void pdf2text(String pdffile,String txtfile) throws IOException{
  
  String pdfname=pdffile;
  String txtname=txtfile;
  String xpdfpath="D:/workspace/testsearch2/xpdf/";
  String[] cmd=new String[]{xpdfpath+"pdftotext","-layout","-enc","GBK","-nopgbrk",pdfname,txtname};
  //-layout表示保持原有的layout,enc指定字符集,-nopgbrk指定不分页
  Process p=Runtime.getRuntime().exec(cmd);
 }
 public static void main(String[] args) {
  // TODO Auto-generated method stub
  try {
   pdf2text("D:/workspace/testsearch2/htmls/123.pdf","D:/workspace/testsearch2/htmls/123.txt");
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }

}
效果如下:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值