.net读取pdf文本(一)

最难的是转PDF啦!最开始是使用XPDF来做,但是语言那么多,编码那么杂,上哪里去找合适的办法啊,而且要求在运行时调用.EXE文件,估计异常一大堆。

索性去找PDFBox,而且要命的是传说这个不支持中文!这个是一个开源的java项目,编码出来当然是java的啦,怎么用.NET调用呢?

正在郁闷毛躁的时候,我看到了一个外国博客上的文章studentclub.ro/lucians_weblog/archive/2007/03/22/read-from-a-pdf-file-using-c.aspx

文章如下:

know, this may seem like a simple task, and you will probably find references on the web about how to do this. But I’ll also write a blog post on this topic, as I came across this problem today.

So, if you have a PDF file and don’t know how to read data from it, here it is what you could do.

 

First of all, you’ll need some DLLs that will help you manipulate the PDF files. I came across the PDFBox. What is PDFBox? I’ll cite from their website: PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

 

Oh, nice, you’ll say, but I need a .NET solution. Don’t worry. Even though PDFBox is written in Java, there is also a .NET version that is available. It utilizes IKVM (also, a very interesting project: an implementation of the Java language for .NET Framework and Mono) to create a fully functioning PDF library for the .NET framework. The released version contains a bin directory with all of the required DLL files.

 

So you’ll have to download the PDFBox package. In this package you’ll find a bin directory. To read your PDF file, you’ll need the following files:

  • IKVM.GNU.Classpath.dll
  • PDFBox-0.7.3.dll
  • FontBox-0.1.0-dev.dll
  • IKVM.Runtime.dll

 

You’ll have to add a reference to the first two in your project. You’ll also have to copy the last two on your project’s bin directory.

The program will look something like this (if you’re working with a Console application):

 

using System;

using org.pdfbox.pdmodel;

using org.pdfbox.util;

 

namespace PDFReader

{

    class Program

    {

        static void Main(string[] args)

        {

            PDDocument doc = PDDocument.load("lopreacamasa.pdf");

            PDFTextStripper pdfStripper = new PDFTextStripper();

            Console.Write(pdfStripper.getText(doc));

        }

    }

}

哈哈,希望来了!

原来可以通过一个叫IKVM 的开源工具可以将java的库镜像到.net的版本下

而且,更好的是PDFBox 0.7.3可以支持中文了!而且是很好的支持!

所以开发起来相当容易了 。

PS:在此纠正一下外国那小子的一个错误

在bin文件下面同样需要加载bcprov-jdk14-132.dll. 否则会报错,而且我找了半天才发现是少了这个引用库。

也就是说,转PDF的方法步骤如下:

1.下载PDFBox 0.7.3   sourceforge.net/project/showfiles.php

2.复制并加载如下5个DLL文件到bin目录下面

IKVM.GNU.Classpath.dll PDFBox-0.7.3.dll FontBox-0.1.0-dev.dll IKVM.Runtime.dll bcprov-jdk14-132.dll

之后示例代码如下:

using org.pdfbox.pdmodel;
using org.pdfbox.util;
using org.pdfbox;

    public string PdfReader(string filename)
    {
        string fullname = DocPath + filename;
        PDDocument doc = PDDocument.load(fullname);
        PDFTextStripper stripper = new PDFTextStripper();
        string pdoc = stripper.getText(doc);
        return pdoc;
    }

 

太简单了!但是找到这个方法可是太辛苦了!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值