python读取word指定内容_Python-docx 读取word.docx内容

第一次写博客,也不知道要写点儿什么好,所以就把我在学习Python的过程中遇到的问题记录下来,以便之后查看,本人小白,写的不好,如有错误,还请大家批评指正!

中文编码问题总是让人头疼,想要用Python读取word中的内容,用open()经常报错,上网一搜结果发现了Python有专门读取.docx的模块python_docx(只能读取.docx文件,不能读取.doc文件),用起来很方便。

安装python-docx:

pip install python_docx

(注意:不是pip install docx ! docx也可以安装,但总是报错,缺少exceptions,无法导入)

接下来就可以用Python_docx 来读取word文本了。

代码如下:

importdocxfrom docx importDocument

path= "C:\\Users\\Administrator\\Desktop\\word.docx"document=Document(path)for paragraph indocument.paragraphs:print(paragraph.text)

运行即可输出文本。

我尝试用docx读取.doc文本

代码如下:

importosimportdocxfor filename inos.listdir(os.getcwd()):if filename.endswith('.doc'):print(filename[:-4])

doc= docx.Document(filename[:-4]+".docx")for para indoc.paragraphs:print (para.text)

结果报错:docx.opc.exceptions.PackageNotFoundError: Package not found。还是无法识别doc

引用1楼,“改变拓展名并没有改变其编码方式,因此无法读取文本内容,需将doc文件另存为docx文件后再用python-docx读取其内容”

# Document 还有添加标题、分页、段落、图片、章节等方法,说明如下

| add_heading(self, text='', level=1)|Return a heading paragraph newly added to the end of the document,| containing *text* andhaving its paragraph style determined by| *level*. If *level* is 0, the style is set to `Title`. If *level* is

| 1 (or omitted), `Heading 1` is used. Otherwise the style isset to| `Heading {level}`. Raises |ValueError| if *level* isoutside the| range 0-9.|

|add_page_break(self)| Return a paragraph newly added to the end of the document and

| containing only a page break.|

| add_paragraph(self, text='', style=None)|Return a paragraph newly added to the end of the document, populated| with *text* and having paragraph style *style*. *text*can contain|tab (``\t``) characters, which are converted to the appropriate XML| form for a tab. *text* can also include newline (``\n``) orcarriage| return (``\r``) characters, each of which isconverted to a line| break.|

| add_picture(self, image_path_or_stream, width=None, height=None)| Return a new picture shape added inits own paragraph at the end of|the document. The picture contains the image at| *image_path_or_stream*, scaled based on *width* and *height*. If| neither width nor height isspecified, the picture appears at its| native size. If only one is specified, it isused to compute| a scaling factor that isthen applied to the unspecified dimension,|preserving the aspect ratio of the image. The native size of the| picture is calculated using the dots-per-inch (dpi) value specified| in the image file, defaulting to 72 dpi if no value isspecified, as| isoften the case.|

| add_section(self, start_type=2)| Return a |Section|object representing a new section added at the end| of the document. The optional *start_type*argument must be a member| of the :ref:`WdSectionStart` enumeration, anddefaults to| ``WD_SECTION.NEW_PAGE`` if notprovided.|

| add_table(self, rows, cols, style=None)| Add a table having row and column counts of *rows* and *cols*

| respectively and table style of *style*. *style*may be a paragraph| style object or a paragraph style name. If *style* is |None|, the|table inherits the default table style of the document.|

|save(self, path_or_stream)| Save this document to *path_or_stream*, which can be eit a path to| a filesystem location (a string) or a file-like object.

docx还有许多其它功能,还正在学习中,详见官方文档:https://python-docx.readthedocs.io/en/latest/user/quickstart.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值