通过XPDF抽取PDF中的中文文本

 通过XPDF抽取PDF中的中文文本

1、下载XPDF,下载地址: ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip

2、下载字体Gbsn00lp.ttf和gkai00mp.ttf,下载地址:ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-simplified.tar.gz

3、解压XPDF和字体,将字体放到xpdf\chinese-simplified\CMap目录下

4、修改add-to-xpdfrc文件中的地址 ,将路径该为本机安装路径

#----- begin Chinese Simplified support package (2004-jul-27) cidToUnicode Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/Adobe-GB1.cidToUnicode unicodeMap ISO-2022-CN E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/ISO-2022-CN.unicodeMap unicodeMap EUC-CN E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/EUC-CN.unicodeMap unicodeMap GBK E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/GBK.unicodeMap cMapDir Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap toUnicodeDir E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap displayCIDFontTT Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap\gkai00mp.ttf #----- end Chinese Simplified support package

5、修改xpdfrc文件 ,把地址修改为本机地址

cidToUnicode Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/Adobe-GB1.cidToUnicode unicodeMap ISO-2022-CN E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/ISO-2022-CN.unicodeMap unicodeMap EUC-CN E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/EUC-CN.unicodeMap unicodeMap GBK E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/GBK.unicodeMap cMapDir Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap toUnicodeDir E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap displayCIDFontTT Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified\CMap\gkai00mp.ttf

6、编写简单的程序

string xpdfPath = @"E:\Study\Flex\xpdf-chinese-simplified\xpdf\pdftotext.exe"; string filename = @"E:\Work\FlashViewer\FlashViewer\Flex\Pdf\mayun.pdf"; string strCmd = " -cfg xpdfrc -q " + filename + " - "; Process p = new Process(); p.StartInfo.FileName = xpdfPath;//exe,bat and so on p.StartInfo.WindowStyle = ProcessWindowStyle.Hidden; p.StartInfo.Arguments = strCmd; p.StartInfo.RedirectStandardOutput = true; p.StartInfo.UseShellExecute = false; try { p.Start(); string strmsg = p.StandardOutput.ReadToEnd(); IOHelp.WriteFile(path, strmsg, false); p.WaitForExit(); p.Close(); } catch(Exception e) { Console.WriteLine(e.Message.ToString()); }

转载于:https://www.cnblogs.com/jiang1984j/archive/2010/07/23/1986758.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值