通过XPDF抽取PDF中的中文文本
1、下载XPDF,下载地址: ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
2、下载字体Gbsn00lp.ttf和gkai00mp.ttf,下载地址:ftp://ftp.foolabs.com/pub/xpdf/xpdf-chinese-simplified.tar.gz
3、解压XPDF和字体,将字体放到xpdf\chinese-simplified\CMap目录下
4、修改add-to-xpdfrc文件中的地址 ,将路径该为本机安装路径
#----- begin Chinese Simplified support package (2004-jul-27) cidToUnicode Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/Adobe-GB1.cidToUnicode unicodeMap ISO-2022-CN E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/ISO-2022-CN.unicodeMap unicodeMap EUC-CN E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/EUC-CN.unicodeMap unicodeMap GBK E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/GBK.unicodeMap cMapDir Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap toUnicodeDir E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap displayCIDFontTT Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap\gkai00mp.ttf #----- end Chinese Simplified support package
5、修改xpdfrc文件 ,把地址修改为本机地址
cidToUnicode Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/Adobe-GB1.cidToUnicode unicodeMap ISO-2022-CN E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/ISO-2022-CN.unicodeMap unicodeMap EUC-CN E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/EUC-CN.unicodeMap unicodeMap GBK E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/GBK.unicodeMap cMapDir Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap toUnicodeDir E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified/CMap displayCIDFontTT Adobe-GB1 E:\Study\Flex\xpdf-chinese-simplified\xpdf\chinese-simplified\CMap\gkai00mp.ttf
6、编写简单的程序
string xpdfPath = @"E:\Study\Flex\xpdf-chinese-simplified\xpdf\pdftotext.exe"; string filename = @"E:\Work\FlashViewer\FlashViewer\Flex\Pdf\mayun.pdf"; string strCmd = " -cfg xpdfrc -q " + filename + " - "; Process p = new Process(); p.StartInfo.FileName = xpdfPath;//exe,bat and so on p.StartInfo.WindowStyle = ProcessWindowStyle.Hidden; p.StartInfo.Arguments = strCmd; p.StartInfo.RedirectStandardOutput = true; p.StartInfo.UseShellExecute = false; try { p.Start(); string strmsg = p.StandardOutput.ReadToEnd(); IOHelp.WriteFile(path, strmsg, false); p.WaitForExit(); p.Close(); } catch(Exception e) { Console.WriteLine(e.Message.ToString()); }