python 读取pdf cid_如何处理PDFMiner提取的文本中的CID？

最新推荐文章于 2022-10-25 09:58:38 发布

weixin_39576294

最新推荐文章于 2022-10-25 09:58:38 发布

阅读量1.6k

点赞数

文章标签： python 读取pdf cid

本文链接：https://blog.csdn.net/weixin_39576294/article/details/111442862

版权

使用pdfminer.six在Python 3.6中提取包含CID的印地语文本时，遇到问题。CID是PDF中映射到字形索引的字符标识。虽然PDF查看器可以通过CMAP表显示字形，但如何将字符代码关联到Unicode值？此问题可能涉及字体许可问题，但主要关注的是理解CID问题的原因和合法性。

摘要由CSDN通过智能技术生成

I've some PDFs which are in Hindi, and have extractable text. I used pdfminer.six for python 3.6, to do the extraction. The output looks like:

As one can see, there are a number of characters that are converted into the form "(cid :number)".

On further analysis, I found out that a PDF contains CMAPs which map character codes to glyph indices. So, a CID is a character identity for the glyph it maps to, inside the CMAP table.

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

Moreover, according to a comment to this similar question, this process may not be legal. But I'm not trying to steal someone's font. I want the text. How does this process become illegal?

Since there are many questions like this one, I want to

最低0.47元/天解锁文章

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

weixin_39576294

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

python 读取pdf cid_记一次为解决Python读取PDF文件的Shell操作

weixin_39800331的博客

02-03

775

一、背景本想将 PDF 文件转换为 Word 文档，然后网上搜索了一下发现有挺多转换的软件。有的是免费的、收费，咱也不知哪个好使，还得一个个安装试用。先不说能不解决问题，就这安装试用想想就脑壳疼。便想起了"Python 大法"，随即搜了几篇看起来比较完整的博客，二话不说粘贴复制，改改运行试试。使用环境(python3.6+pdfminer3k)，代码这里就不放出来了。二、问题运气不好，这一试就报错...

python 读取pdf cid_python使用pdfminer解析pdf文件的方法示例

weixin_42131405的博客

02-03

1057

最近要做个从 pdf 文件中抽取文本内容的工具，大概查了一下 python 里可以使用 pdfminer 来实现。下面就看看怎样使用吧。PDFMiner是一个可以从PDF文档中提取信息的工具。与其他PDF相关的工具不同，它注重的完全是获取和分析文本数据。PDFMiner允许你获取某一页中文本的准确位置和一些诸如字体、行数的信息。它包括一个PDF转换器，可以把PDF文件转换成HTML等格式。它还有一...

参与评论您还未登录，请先登录后发表或查看评论

Python使用pdfminer库解析pdf得到的一大堆CID和数字如何处理

weixin_42219511的博客

05-26

2313

python识别pdf

python处理pdf实例_python使用pdfminer解析pdf文件的方法示例

weixin_39931146的博客

11-23

367

python pdf处理图片_Python 将pdf转换成txt（不处理图片）