php实现读取pdf内容,php借助Xpdf读取PDF中的内容

最新推荐文章于 2024-04-23 23:41:42 发布

weixin_39826984

最新推荐文章于 2024-04-23 23:41:42 发布

阅读量317

点赞数

文章标签： php实现读取pdf内容

[root@localhost ~]# mkdir -p /lcf/upan

[root@localhost ~]# mkdir -p /lcf/cdrom

[root@localhost ~]# mkdir -p /lcf/xpdf

[root@localhost ~]# cd /lcf/upan/

[root@localhost upan]# cp xpdf/* ../xpdf/ (下载的文件放入/lcf/xpdf目录)

[root@localhost upan]# cd ../xpdf/

[root@localhost xpdf]# tar -zxvf xpdfbin-linux-3.03.tar.gz

[root@localhost xpdf]# cd xpdfbin-linux-3.03

[root@localhost xpdfbin-linux-3.03]# cat INSTALL

[root@localhost xpdfbin-linux-3.03]# cd bin32/

[root@localhost bin32]# cp ./* /usr/local/bin/

[root@localhost bin32]# cd ../doc/

[root@localhost doc]# mkdir -p /usr/local/man/man1

[root@localhost doc]# mkdir -p /usr/local/man/man5

[root@localhost doc]# cp *.1 /usr/local/man/man1

[root@localhost doc]# cp *.5 /usr/local/man/man5复制代码

如果不需要读取中文的话，到这里就可以结束了，如果需要，那我们继续往后

[root@localhost doc]# cp sample-xpdfrc /usr/local/etc/xpdfrc

[root@localhost xpdf]# cd /lcf/xpdf

[root@localhost xpdf]# tar -zxvf xpdf-chinese-simplified.tar.gz

[root@localhost xpdf]# cd xpdf-chinese-simplified

[root@localhost xpdf]# mkdir -p/usr/local/share/xpdf/chinese-simplified

[root@localhost xpdf]# cd xpdf-chinese-simplified/

[root@localhost xpdf-chinese-simplified]# cp Adobe-GB1.cidToUnicode ISO-2022-CN.unicodeMap EUC-CN.unicodeMap GBK.unicodeMap CMAP /usr/local/share/xpdf/chinese-simplified/复制代码

把chinese-simplified里面文件add-to-xpdfrc 的内容复制到/usr/local/etc/xpdfrc文件中。记得里面的路径要正确。(注意，这里面的简体中文包包括以下三种格式：ISO-2022-CN，EUC-CN，GBK ，看清楚哦，不支持UTF-8，可以先转为GBK，然后进行转义)

三、功能实现

至此，所有的配置完毕，我们要开始使用它了。

如果是简单的PDF读取，那么直接用下面的语句就OK了。

$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -');

如果需要转中文，如此这般，加上参数。

$content = shell_exec('/usr/local/bin/pdftotext -layout -enc GBK '.$filename.' -');

当然，加了参数之后依然是不影响英文的转换的，所以，放心使用吧。需要注意的是，这里转出来的是GBK编码的哦，现在网站很多用的是UTF-8，想要不显示乱码的话，需要再次转义一下哦。

$content = mb_convert_encoding($content, 'UTF-8','GBK');

读取出来的内容，可以再写代码自行处理。

pdftotext的主要参数:

OPTIONS

Many of the following options can be set with configuration file com-

mands. These are listed in square brackets with the description of the

corresponding command line option.

-f number

Specifies the first page to convert.

-l number

Specifies the last page to convert.

-layout

Maintain (as best as possible) the original physical layout of

the text. The default is to 'undo' physical layout (columns,

hyphenation, etc.) and output the text in reading order.

-fixed number

Assume fixed-pitch (or tabular) text, with the specified charac-

ter width (in points). This forces physical layout mode.

-raw Keep the text in content stream order. This is a hack which

often "undoes" column formatting, etc. Use of raw mode is no

longer recommended.

-htmlmeta

Generate a simple HTML file, including the meta information.

This simply wraps the text inand and prepends the

meta headers.

-enc encoding-name

weixin_39826984

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
php实现读取pdf内容,php借助Xpdf读取PDF中的内容

[root@localhost ~]# mkdir -p /lcf/upan[root@localhost ~]# mkdir -p /lcf/cdrom[root@localhost ~]# mkdir -p /lcf/xpdf[root@localhost ~]# cd /lcf/upan/[root@localhost upan]# cp xpdf/* ../xpdf/ (下载的文件放入/l...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。