python解析xml文件为pdf_Python-解析文件(docx,pdf和odt)并将内容转换为我的数据模型...

I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.

We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility.

So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml.

The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images (if any), and chapter text, so that I can fit the content into a data model of:

Book --> o2m --> Chapter --> o2m --> Image

Clearly, PDFMiner has a .get_outlines() function that will return the TOC for me. But it can't link any of the returned tuples (chapter numbers and titles) to the actual pages for that chapter.

Even more problematic is that with docx/odt; those are just paragraphs -- -- elements, with attrs and child elements.

I'm looking for idea(s) to extrapolate some sense of structure from these filetypes, and if need be, I can apply those ideas (2 or 3) as suggested formats for our users who wish to import a book via one of those file formats.

解决方案

(Python 3 answer)

When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/

What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.

I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:

import zipfile

try:

from xml.etree.cElementTree import XML

print("cElementTree")

except ImportError:

from xml.etree.ElementTree import XML

print("ElementTree")

EDIT:

If, in the body of the function, you replace

'word/document.xml'

with

'word/footnotes.xml'

or

'word/endnotes.xml'

you can get the footnotes and endnotes, respectively.

The markers for where they were in the text are lost, however.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值