itextsharp 获取文本_iTextSharp的文本提取

最新推荐文章于 2022-08-09 17:55:25 发布

weixin_39942492

最新推荐文章于 2022-08-09 17:55:25 发布

阅读量397

点赞数

文章标签： itextsharp 获取文本

本文链接：https://blog.csdn.net/weixin_39942492/article/details/112818924

版权

I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)

token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)

While token.NextToken()

tknType = token.TokenType()

tknValue = token.StringValue

I can meassure the length of the content but I cannot get the actual string content.

I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.

Now the question is, How can I extract text regardless of the font setting?

Thanks

解决方案

complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version

public static string GetTextFromAllPages(String pdfPath)

{

PdfReader reader = new PdfReader(pdfPath);

StringWriter output = new StringWriter();

for (int i = 1; i <= reader.NumberOfPages; i++)

output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));

return output.ToString();

}

weixin_39942492

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
itextsharp 获取文本_iTextSharp的文本提取

I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is se...
复制链接

扫一扫