pdfminer转换PDF为html,PDFMiner

weixin_39819576

于 2021-06-02 18:40:37 发布

阅读量477

点赞数

文章标签： pdfminer转换PDF为html

2014/03/24: Bugfixes and improvements for fauly PDFs.

API changes:

PDFDocument.initialize() method is removed and no longer needed.

A password is given as an argument of a PDFDocument constructor.

2013/11/13: Bugfixes and minor improvements.

As of November 2013, there were a few changes made to the PDFMiner API

prior to October 2013. This is the result of code restructuring. Here

is a list of the changes:

PDFDocument class is moved to pdfdocument.py.

PDFDocument class now takes a PDFParser object as an argument.

PDFDocument.set_parser() and PDFParser.set_document() is removed.

PDFPage class is moved to pdfpage.py.

process_pdf function is implemented as PDFPage.get_pages.

2013/10/22: Sudden resurge of interests. API changes.

Incorporated a lot of patches and robust handling of broken PDFs.

2011/05/15: Speed improvements for layout analysis.

2011/05/15: API changes. LTText.get_text() is added.

2011/04/20: API changes. LTPolygon class was renamed as LTCurve.

2011/04/20: LTLine now represents horizontal/vertical lines only. Thanks to Koji Nakagawa.

2011/03/07: Documentation improvements by Jakub Wilk. Memory usage patch by Jonathan Hunt.

2011/02/27: Bugfixes and layout analysis improvements. Thanks to fujimoto.report.

2010/12/26: A couple of bugfixes and minor improvements. Thanks to Kevin Brubeck Unhammer and Daniel Gerber.

2010/10/17: A couple of bugfixes and minor improvements. Thanks to standardabweichung and Alastair Irving.

2010/09/07: A minor bugfix. Thanks to Alexander Garden.

2010/08/29: A couple of bugfixes. Thanks to Sahan Malagi, pk, and Humberto Pereira.

2010/07/06: Minor bugfixes. Thanks to Federico Brega.

2010/06/13: Bugfixes and improvements on CMap data compression. Thanks to Jakub Wilk.

2010/04/24: Bugfixes and improvements on TOC extraction. Thanks to Jose Maria.

2010/03/26: Bugfixes. Thanks to Brian Berry and Lubos Pintes.

2010/03/22: Improved layout analysis. Added regression tests.

2010/03/12: A couple of bugfixes. Thanks to Sean Manefield.

2010/02/27: Changed the way of internal layout handling. (LTTextItem -> LTChar)

2010/02/15: Several bugfixes. Thanks to Sean.

2010/02/13: Bugfix and enhancement. Thanks to André Auzi.

2010/02/07: Several bugfixes. Thanks to Hiroshi Manabe.

2010/01/31: JPEG image extraction supported. Page rotation bug fixed.

2010/01/04: Python 2.6 warning removal. More doctest conversion.

2010/01/01: CMap bug fix. Thanks to Winfried Plappert.

2009/12/24: RunLengthDecode filter added. Thanks to Troy Bollinger.

2009/12/20: Experimental polygon shape extraction added. Thanks to Yusuf Dewaswala for reporting.

2009/12/19: CMap resources are now the part of the package. Thanks to Adobe for open-sourcing them.

2009/11/29: Password encryption bug fixed. Thanks to Yannick Gingras.

2009/10/31: SGML output format is changed and renamed as XML.

2009/10/24: Charspace bug fixed. Adjusted for 4-space indentation.

2009/10/04: Another matrix operation bug fixed. Thanks to Vitaly Sedelnik.

2009/09/12: Fixed rectangle handling. Able to extract image boundaries.

2009/08/30: Fixed page rotation handling.

2009/08/26: Fixed zlib decoding bug. Thanks to Shon Urbas.

2009/08/24: Fixed a bug in character placing. Thanks to Pawan Jain.

2009/07/21: Improvement in layout analysis.

2009/07/11: Improvement in layout analysis. Thanks to Lubos Pintes.

2009/05/17: Bugfixes, massive code restructuring, and simple graphic element support added. setup.py is supported.

2009/03/30: Text output mode added.

2009/03/25: Encoding problems fixed. Word splitting option added.

2009/02/28: Robust handling of corrupted PDFs. Thanks to Troy Bollinger.

2009/02/01: Various bugfixes. Thanks to Hiroshi Manabe.

2009/01/17: Handling a trailer correctly that contains both /XrefStm and /Prev entries.

2009/01/10: Handling Type3 font metrics correctly.

2008/12/28: Better handling of word spacing. Thanks to Christian Nentwich.

2008/09/06: A sample pdf2html webapp added.

2008/08/30: ASCII85 encoding filter support.

2008/07/27: Tagged contents extraction support.

2008/07/10: Outline (TOC) extraction support.

2008/06/29: HTML output added. Reorganized the directory structure.

2008/04/29: Bugfix for Win32. Thanks to Chris Clark.

2008/04/27: Basic encryption and LZW decoding support added.

2008/01/07: Several bugfixes. Thanks to Nick Fabry for his vast contribution.

2007/12/31: Initial release.

2004/12/24: Start writing the code out of boredom...

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。