python解析xml文件为pdf_Python-解析文件（docx，pdf和odt）并将内容转换为我的数据模型...

最新推荐文章于 2022-07-16 10:17:12 发布

weixin_39603622

最新推荐文章于 2022-07-16 10:17:12 发布

阅读量161

点赞数

文章标签： python解析xml文件为pdf

I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.

We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility.

So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml.

The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images (if any), and chapter text, so that I can fit the content into a data model of:

Book --> o2m --> Chapter --> o2m --> Image

Clearly, PDFMiner has a .get_outlines() function that will return the TOC for me. But it can't link any of the returned tuples (chapter numbers and titles) to the actual pages for that chapter.

Even more problematic is that with docx/odt; those are just paragraphs -- -- elements, with attrs and child elements.

I'm looking for idea(s) to extrapolate some sense of structure from these filetypes, and if need be, I can apply those ideas (2 or 3) as suggested formats for our users who wish to import a book via one of those file formats.

解决方案

(Python 3 answer)

When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/

What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.

I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:

import zipfile

try:

from xml.etree.cElementTree import XML

print("cElementTree")

except ImportError:

from xml.etree.ElementTree import XML

print("ElementTree")

EDIT:

If, in the body of the function, you replace

'word/document.xml'

with

'word/footnotes.xml'

'word/endnotes.xml'

you can get the footnotes and endnotes, respectively.

The markers for where they were in the text are lost, however.

weixin_39603622

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python解析xml文件为pdf_Python-解析文件（docx，pdf和odt）并将内容转换为我的数据模型...

I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.We already have a tool for the .epub format, and we'd like to extend the functionality beyond...
复制链接

扫一扫