Linux下使用LibreOffice+python将doc/docx/wps格式的文档转成html/txt/docx等格式

Linux下的word文档格式转换工具

最近接到一个需求,要将所有不同格式的文档(包括.doc/.docx/.wps)转成统一格式,如都转为.docx,或直接转为.html 或.txt。经调研后,发现有这样几款工具:

  • win32com
  • python-docx
  • pydocx

  • 可能还有,我就不再赘述了。经过全面调研,我发现这些工具存在这样的问题——Python相关工具要么无法处理.doc(只能处理.docx),要么要求必须在windows环境下使用(如win32com)。当前大家的生产环境一般都是Linux环境,更换win服务器会造成一系列的连带问题,比如其他库是否兼容等等,非常麻烦,所以找到.doc/.wps在Linux下的处理方式非常重要。还好,最后被我找到了,那就是LibreOffice

LibreOffice具体用法

  1. 首先,直接在命令行执行libreoffice --version,看看你是否已经安装此款工具。如果还没有安装,参考下文安装LibreOffice
  2. 安装完毕后,使用以下命令,对待转格式的文档进行格式转换,示例如下:
    将.doc格式文档转为txt格式:
libreoffice --headless --convert-to txt path-to-your-doc.doc

你同样可以指定转换后的文件输出路径,也可以批量地将doc/docx/wps文件传给LibreOffice接口:

libreoffice --headless --convert-to html --outdir /your/output/dir /your/doc_docx_wps/files/*.{dosx,doc,wps}
  1. 使用python脚本执行格式转换
    这个其实没什么玄乎的,就是用Python执行命令行而已:
import os
os.system("libreoffice --headless --convert-to txt path-to-your-doc.doc")

当然,如果你嫌这个接口的单进程速度太慢,你也可以用Python执行多进程启动转换:

import subprocess
import os, glob
from multiprocessing.dummy import Pool

def worker(fname, dstdir=os.path.expanduser("~")):
    subprocess.call(["libreoffice", "--headless", "--convert-to", "pdf", fname], cwd=dstdir)

pool = Pool()
pool.map(worker, glob.iglob(
        os.path.join(os.path.expanduser("~"), "*.doc")
    ))

LibreOffice的其他转换功能

其实LibreOffice功能很强大,它还可以对xhtml、pdf、jpeg、png等等多种格式进行转换。具体支持的格式如下

The following list of document formats are currently available:

  bib      - BibTeX [.bib]
  doc      - Microsoft Word 97/2000/XP [.doc]
  doc6     - Microsoft Word 6.0 [.doc]
  doc95    - Microsoft Word 95 [.doc]
  docbook  - DocBook [.xml]
  docx     - Microsoft Office Open XML [.docx]
  docx7    - Microsoft Office Open XML [.docx]
  fodt     - OpenDocument Text (Flat XML) [.fodt]
  html     - HTML Document (OpenOffice.org Writer) [.html]
  latex    - LaTeX 2e [.ltx]
  mediawiki - MediaWiki [.txt]
  odt      - ODF Text Document [.odt]
  ooxml    - Microsoft Office Open XML [.xml]
  ott      - Open Document Text [.ott]
  pdb      - AportisDoc (Palm) [.pdb]
  pdf      - Portable Document Format [.pdf]
  psw      - Pocket Word [.psw]
  rtf      - Rich Text Format [.rtf]
  sdw      - StarWriter 5.0 [.sdw]
  sdw4     - StarWriter 4.0 [.sdw]
  sdw3     - StarWriter 3.0 [.sdw]
  stw      - Open Office.org 1.0 Text Document Template [.stw]
  sxw      - Open Office.org 1.0 Text Document [.sxw]
  text     - Text Encoded [.txt]
  txt      - Text [.txt]
  uot      - Unified Office Format text [.uot]
  vor      - StarWriter 5.0 Template [.vor]
  vor4     - StarWriter 4.0 Template [.vor]
  vor3     - StarWriter 3.0 Template [.vor]
  wps      - Microsoft Works [.wps]
  xhtml    - XHTML Document [.html]

The following list of graphics formats are currently available:

  bmp      - Windows Bitmap [.bmp]
  emf      - Enhanced Metafile [.emf]
  eps      - Encapsulated PostScript [.eps]
  fodg     - OpenDocument Drawing (Flat XML) [.fodg]
  gif      - Graphics Interchange Format [.gif]
  html     - HTML Document (OpenOffice.org Draw) [.html]
  jpg      - Joint Photographic Experts Group [.jpg]
  met      - OS/2 Metafile [.met]
  odd      - OpenDocument Drawing [.odd]
  otg      - OpenDocument Drawing Template [.otg]
  pbm      - Portable Bitmap [.pbm]
  pct      - Mac Pict [.pct]
  pdf      - Portable Document Format [.pdf]
  pgm      - Portable Graymap [.pgm]
  png      - Portable Network Graphic [.png]
  ppm      - Portable Pixelmap [.ppm]
  ras      - Sun Raster Image [.ras]
  std      - OpenOffice.org 1.0 Drawing Template [.std]
  svg      - Scalable Vector Graphics [.svg]
  svm      - StarView Metafile [.svm]
  swf      - Macromedia Flash (SWF) [.swf]
  sxd      - OpenOffice.org 1.0 Drawing [.sxd]
  sxd3     - StarDraw 3.0 [.sxd]
  sxd5     - StarDraw 5.0 [.sxd]
  sxw      - StarOffice XML (Draw) [.sxw]
  tiff     - Tagged Image File Format [.tiff]
  vor      - StarDraw 5.0 Template [.vor]
  vor3     - StarDraw 3.0 Template [.vor]
  wmf      - Windows Metafile [.wmf]
  xhtml    - XHTML [.xhtml]
  xpm      - X PixMap [.xpm]

The following list of presentation formats are currently available:

  bmp      - Windows Bitmap [.bmp]
  emf      - Enhanced Metafile [.emf]
  eps      - Encapsulated PostScript [.eps]
  fodp     - OpenDocument Presentation (Flat XML) [.fodp]
  gif      - Graphics Interchange Format [.gif]
  html     - HTML Document (OpenOffice.org Impress) [.html]
  jpg      - Joint Photographic Experts Group [.jpg]
  met      - OS/2 Metafile [.met]
  odg      - ODF Drawing (Impress) [.odg]
  odp      - ODF Presentation [.odp]
  otp      - ODF Presentation Template [.otp]
  pbm      - Portable Bitmap [.pbm]
  pct      - Mac Pict [.pct]
  pdf      - Portable Document Format [.pdf]
  pgm      - Portable Graymap [.pgm]
  png      - Portable Network Graphic [.png]
  potm     - Microsoft PowerPoint 2007/2010 XML Template [.potm]
  pot      - Microsoft PowerPoint 97/2000/XP Template [.pot]
  ppm      - Portable Pixelmap [.ppm]
  pptx     - Microsoft PowerPoint 2007/2010 XML [.pptx]
  pps      - Microsoft PowerPoint 97/2000/XP (Autoplay) [.pps]
  ppt      - Microsoft PowerPoint 97/2000/XP [.ppt]
  pwp      - PlaceWare [.pwp]
  ras      - Sun Raster Image [.ras]
  sda      - StarDraw 5.0 (OpenOffice.org Impress) [.sda]
  sdd      - StarImpress 5.0 [.sdd]
  sdd3     - StarDraw 3.0 (OpenOffice.org Impress) [.sdd]
  sdd4     - StarImpress 4.0 [.sdd]
  sxd      - OpenOffice.org 1.0 Drawing (OpenOffice.org Impress) [.sxd]
  sti      - OpenOffice.org 1.0 Presentation Template [.sti]
  svg      - Scalable Vector Graphics [.svg]
  svm      - StarView Metafile [.svm]
  swf      - Macromedia Flash (SWF) [.swf]
  sxi      - OpenOffice.org 1.0 Presentation [.sxi]
  tiff     - Tagged Image File Format [.tiff]
  uop      - Unified Office Format presentation [.uop]
  vor      - StarImpress 5.0 Template [.vor]
  vor3     - StarDraw 3.0 Template (OpenOffice.org Impress) [.vor]
  vor4     - StarImpress 4.0 Template [.vor]
  vor5     - StarDraw 5.0 Template (OpenOffice.org Impress) [.vor]
  wmf      - Windows Metafile [.wmf]
  xhtml    - XHTML [.xml]
  xpm      - X PixMap [.xpm]

The following list of spreadsheet formats are currently available:

  csv      - Text CSV [.csv]
  dbf      - dBASE [.dbf]
  dif      - Data Interchange Format [.dif]
  fods     - OpenDocument Spreadsheet (Flat XML) [.fods]
  html     - HTML Document (OpenOffice.org Calc) [.html]
  ods      - ODF Spreadsheet [.ods]
  ooxml    - Microsoft Excel 2003 XML [.xml]
  ots      - ODF Spreadsheet Template [.ots]
  pdf      - Portable Document Format [.pdf]
  pxl      - Pocket Excel [.pxl]
  sdc      - StarCalc 5.0 [.sdc]
  sdc4     - StarCalc 4.0 [.sdc]
  sdc3     - StarCalc 3.0 [.sdc]
  slk      - SYLK [.slk]
  stc      - OpenOffice.org 1.0 Spreadsheet Template [.stc]
  sxc      - OpenOffice.org 1.0 Spreadsheet [.sxc]
  uos      - Unified Office Format spreadsheet [.uos]
  vor3     - StarCalc 3.0 Template [.vor]
  vor4     - StarCalc 4.0 Template [.vor]
  vor      - StarCalc 5.0 Template [.vor]
  xhtml    - XHTML [.xhtml]
  xls      - Microsoft Excel 97/2000/XP [.xls]
  xls5     - Microsoft Excel 5.0 [.xls]
  xls95    - Microsoft Excel 95 [.xls]
  xlt      - Microsoft Excel 97/2000/XP Template [.xlt]
  xlt5     - Microsoft Excel 5.0 Template [.xlt]
  xlt95    - Microsoft Excel 95 Template [.xlt]
  xlsx     - Microsoft Excel 2007/2010 XML [.xlsx]
  • 9
    点赞
  • 39
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 7
    评论
使用Python将Excel转换为PDF,可以使用LibreOffice来实现。下面是一个简单的代码示例: ```python import os import uno from com.sun.star.beans import PropertyValue def convert_to_pdf(input_file, output_file): # 启动 LibreOffice local_context = uno.getComponentContext() resolver = local_context.ServiceManager.createInstanceWithContext( "com.sun.star.bridge.UnoUrlResolver", local_context) context = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext") desktop = context.ServiceManager.createInstanceWithContext("com.sun.star.frame.Desktop", context) # 打开 Excel 文件 url = uno.systemPathToFileUrl(os.path.abspath(input_file)) doc = desktop.loadComponentFromURL(url, "_blank", 0, ()) # 将文件保存为 PDF output_url = uno.systemPathToFileUrl(os.path.abspath(output_file)) properties = ( PropertyValue("FilterName", 0, "writer_pdf_Export", 0), ) doc.storeToURL(output_url, properties) # 关闭文档LibreOffice doc.close(True) context.ServiceManager.shutdown() # 示例用法 convert_to_pdf("input.xlsx", "output.pdf") ``` 在上面的代码中,我们首先使用 `uno` 模块启动了一个 LibreOffice 实例,并打开了指定的 Excel 文件。接着,我们将文件保存为 PDF 格式,并将其输出到指定的输出文件路径中。最后,我们关闭了文档LibreOffice 实例。 请注意,为了使此代码正常工作,您需要在本地计算机上安装 LibreOffice,并将其启动作为服务。在代码示例中,我们使用的是默认的本地端口 `2002`,如果您的 LibreOffice 实例使用不同的端口,请相应地修改代码。
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

_illusion_

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值