小试debian-7.11.0-amd64+Plone5.1.2全文检索和预览中文WORD中文PDF

先劝诫提醒一下,没接触过Plone的老铁们最好就不要往下看了,哈哈。易度的老潘写的两篇文章讲透了Zope/Plone有多糟心

(知乎上的)https://www.zhihu.com/question/19649024
(豆瓣上的)https://www.douban.com/group/topic/11400495/

我自己用Zope/Plone的体会是,国内用的人少,中文资料极缺,学习的Zope/Plone知识基本上用不到其他项目中去,一点点小问题的解决都要大费周折,而且第三方组件往往不能随核心版本升级,总之是Zope/Plone造成的问题比它解决的问题还多。

2007年的时候没认对方向,一时糊涂,在windows下用Plone-2.1.3配合

老潘(http://old.zope.org/Members/panjunyong)的  CJKSplitter 	ZopeChinaPak
ingeniweb(http://ingeniweb.sourceforge.net/)的 PloneExFile AttachmentField FileSystemStorage

建了一个自己用的电子文档管理系统,二进制文件不保存进ZODB而存入文件系统,FTP批量上传,中文全文检索和预览Word2003和文本类PDF文件,目录列表和搜索结果时截取前几十百把个字符显示在每条下方。

这些年主要升级了两次一次升级到2.5.5,一次到3.3.5(ZODB3.8支持blob,用wc.pageturner加入了图片类PDF预览的功能),系统与其说是用Plone,不如说是用PloneExFile,由于PloneExFile项目停了(不支持Plone4了),再加上一直用的ODBCDA也只支持Plone3(Python2.4),就实在是不想继续升级了。

系统一直自己用着还不错,但是没办法给本部门其他同事共用,因为PloneExFile不支持office2007及以后版本,虽然Products.OpenXml可以加入office2007及以后版本的中文全文检索,但是预览功能和截取字符功能没办法实现了,wc.pageturner也有点小毛病(FTP批量上传的PDF在转换SWF时往往会导致ZODB崩溃,得用fsrecover.py才能修复,网页上传的却正常)。

前一段时间在一个网页上

https://stackoverflow.com/questions/12420334/how-to-use-wc-pageturner-in-plone-4-1

看到wc.pageturner的作者vangheem回复了这样一句话 

If you want a pdf viewer, use collective.documentviewer. 
I am no longer updating wc.pageturner--
collective.documentviewer is a much better viewer and implementation. 
– vangheem 2012-09-14

 他在自己的网站上也提到It is recommended that you do not use this method anymore. Please use collective.documentviewer now which should cover all the use cases.

https://www.nathanvangheem.com/posts/2011/04/14/using-plone-as-a-document-repository.html

 自己也想了解一下现在Plone发展到什么程度了,建文档管理系统的方便程度如何,于是试着用最新的Plone版本来用一用collective.documentviewer。选择debian是因为collective.documentviewer不支持windows,用debian-7.11.0是因为手头正好下载了完整10张DVD,原型测试够用就行,需要说明的是collective.documentviewer要用到docsplit,而docsplit又基于libreoffice或openoffice,安装debian时一定要选择安装“桌面支持”和“开发支持”以及中文支持。好像还有一种libreoffice的headless进程服务,似乎不需要图形界面,用端口提供转换服务(Alfresco中就有用到),但是我没有去试。debian-7.11.0网络安装如果只用debian-7.11.0-amd64的DVD通过FTP提供内网APT Repository服务,只用到DVD1,但是如果用DVD1安装,会要求切换三张DVD,说明用DVD安装的版本更全,我就发现DVD安装的才有中文输入法。

主要参考资料

http://documentcloud.github.io/docsplit
https://www.documentcloud.org/opensource
https://www.nathanvangheem.com/posts/2012/04/29/document-viewer-integration-in-plone.html
https://www.dangtrinh.com/2013/07/plone-review-documents-in-plone-with.html
http://tunmer.me/how-tos/installing-plone-on-ubuntu.html

一、先安装必备的支持组件,部分是安装Plone需要的,部分是运行Plone与其组件需要的

apt-get -y --force-yes install build-essential
apt-get -y --force-yes install gcc g++ sudo git
apt-get -y --force-yes install libxml2 libxml2-dev libxslt1-dev
apt-get -y --force-yes install zlibc zlib1g-dev libbz2-dev libssl-dev p7zip-full unzip
apt-get -y --force-yes install unace unp bzip2 gzip patch
apt-get -y --force-yes install python-dev libjpeg-dev
apt-get -y --force-yes install libsqlite3-dev
apt-get -y --force-yes install libreadline-dev
apt-get -y --force-yes install rubygems
apt-get -y --force-yes install graphicsmagick
apt-get -y --force-yes install poppler-utils poppler-data
apt-get -y --force-yes install ghostscript
apt-get -y --force-yes install tesseract-ocr
apt-get -y --force-yes install pdftk

二、下载Plone5.1.2,解压,基础安装

https://launchpad.net/plone/5.1/5.1.2/+download/Plone-5.1.2-UnifiedInstaller.tgz

解压、检查安装参数 

tar zxvf Plone-5.1.2-UnifiedInstaller.tgz
cd Plone-5.1.2-UnifiedInstaller
./install.sh --help

debian-7.11.0-amd64中自带的是Python2.7.3(查了一下当前stretch:python 2.7.13-2;sid:python 2.7.14-8),不符合Plone5.1.2要求Python version must be 2.7.9+,必须指定--build-python

./install.sh --build-python --target=/opt/plone zeo

安装过程会下载Python-2.7.14.tgz到Plone-5.1.2-UnifiedInstaller/packages目录中,用于编译构建virtualenv环境(如果先前没有apt-get install libreadline-dev,会看到如下提示,提示编译出的不支持readline,安装还是可以完成的)

Warning: This Python does not have readline support.
It may still be usable for Zope, but interacting directly with Python will be painful.

安装出错时可以查看安装LOG,安装成功时LOG也有“ chmod: 更改“***----***”的权限:不允许的操作 ”等字样。

Plone-5.1.2-UnifiedInstaller/install.log

等待漫长的安装过程结束,网络状况的好坏决定了能否顺利完成安装及速度。安装后半段执行了buildout,buildout运行时下载的文件(也包括部分安装包自带组件解压出来的)存在以下目录中

/opt/plone/buildout-cache/downloads/dist/

安装结束会提示管理用户名和密码,如果此时没有记录下来,还可以查看一个记录的文件

cat /opt/plone/zeocluster/adminPassword.txt

admin
jkMq3sadkxJm

同时提示中还表明安装时建立了一个用户组plone_group,和两个用户,这个很重要,后面会用到

ZEO & Client Daemons       :plone_daemon
Code Resources & buildout  :plone_buildout
Setting /opt/plone ownership to plone_buildout:plone_group

三、建立第一个站点

在debian上用root用户启动服务

cd /opt/plone/zeocluster
bin/plonectl start

在客户端的浏览器上连接8080端口

http://10.16.97.205:8080

选择建立一个站点,需要用admin登录,如果用的缺省的站点名字Plone,以后访问站点的URL就是

http://10.16.97.205:8080/Plone

四、rubygems安装docsplit

collective.documentviewer 5.0.1依赖于docsplit,collective.documentviewer是DocumentCloud Projects项目的子项目中NY Times' Document Viewer的Plone绑定。docsplit也是这个项目的子项目,其转换依赖LibreOffice,因此服务器debian安装时一定要选择安装“桌面支持”和“开发支持”。docsplit主页为:

https://rubygems.org/gems/docsplit/versions/0.7.6

collective.documentviewer只支持Plone4和Plone5,并且不支持windows(docsplit好像不支持windows),参见以下网页

https://stackoverflow.com/questions/14543419/can-collective-documentviewer-work-on-windows-2003-server-plone4-2

 gem安装docsplit

gem install docsplit --version=0.7.6

如果网络原因无法安装,可以下载gem文件(下载链接:https://rubygems.org/downloads/docsplit-0.7.6.gem )手工安装:

gem install docsplit-0.7.6.gem

debian中已安装的gem可以在以下目录中找到原始gem文件

/var/lib/gems/1.8/cache/docsplit-0.7.6.gem

五、下载对应tika1.11版本的tika.cfg

 tika是apache的一个java项目,是Apache Lucene的子项目,支持识别二进制文件的格式和编码encoding,提取出文本内容(还有meta等格式信息),据说也支持windows,而ftw.tika是tika的Plone绑定,全文检索就靠它了

最早看到的推荐ftw.tika的文章:
https://stackoverflow.com/questions/23151319/plone-full-text-indexing-excel-files
项目网址:
https://github.com/4teamwork/ftw.tika

下载master分支的zip文件,目前是对应tika 1.11的版本,解压文件,只将tika.cfg文件拷贝到/opt/plone/zeocluster/目录中,与buildout.cfg在同一目录中

六、 安装Oracle官方的jdk1.8

tika 1.11最低要求java1.7。为了运行一个python项目,不仅安装了ruby,还要安装java,会不会被纯Pythoner鄙视?

安装过程略。

 java -version

java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

七、编辑buildout.cfg

编辑前先备份一下,这是必须养成的习惯,后文不再强调

cp /opt/plone/zeocluster/buildout.cfg /opt/plone/zeocluster/buildout.cfg.bk

这部分参考以下网页,这个网页可能无法直接打开,但网页的确存在,原因~!@#¥%……&*()——+

http://blog.abdullahsolutions.com/2016/08/installing-ftwtika-in-plone.html

引用主要内容如下,加入了collective.documentviewer的内容

I love being able to search in all the documents uploaded into plone. I keep on forgetting that this was an add-on and not natively provided. The latest add-on I tried to enable that feature was ftw.tika.
To install it, first download the tika.cfg file from their github page at https://github.com/4teamwork/ftw.tika. Once that has been downloaded, modify your buildout.cfg with:
############
[buildout]
extends =
       ... 
       tika.cfg

eggs =
       ...
       ftw.tika
       collective.documentviewer

zcml =
       ...
       ftw.tika
       ftw.tika-meta

parts =
       ...
       tika-server-download
       tika-server

[client1]
...
zcml-additional += ${tika:zcml}
eggs += ftw.tika

[client2]
...
zcml-additional += ${tika:zcml}
eggs += ftw.tika

[versions]
collective.documentviewer = 5.0.1

############
Once that is done, run buildout. Then you can start the tika server with "bin/tika-server". Then you can start your plone instance. After that make sure you login and enable the tika add-on in your "site-setup", "add-ons" page.

生成patch:diff -uN buildout.cfg buildout.cfg.ok >buildout512.cfg.diff,以后重装时在buildout.cfg同一目录中patch -p0 <buildout512.cfg.diff

--- buildout.cfg	2018-08-11 17:16:24.934227831 +0800
+++ buildout.cfg.ok	2018-08-11 17:16:09.310229089 +0800
@@ -38,6 +38,7 @@
 extends =
     base.cfg
     versions.cfg
+    tika.cfg
 #    http://dist.plone.org/release/5.1.2/versions.cfg
 
 # If you change your Plone version, you'll also need to update
@@ -71,6 +72,8 @@
 eggs =
     Plone
     Pillow
+    ftw.tika
+    collective.documentviewer
 
 ############################################
 # ZCML Slugs
@@ -79,7 +82,8 @@
 # use them. This is increasingly rare.
 zcml =
 #    plone.reload
-
+    ftw.tika
+    ftw.tika-meta
 ############################################
 # Development Eggs
 # ----------------
@@ -149,7 +153,8 @@
     unifiedinstaller
     precompiler
     setpermissions
-
+    tika-server-download
+    tika-server
 ############################################
 # Major Parts
 # ----------------------
@@ -167,12 +172,17 @@
 recipe = plone.recipe.zope2instance
 zeo-address = ${zeoserver:zeo-address}
 http-address = 8080
+ftp-address = 8021
+zcml-additional += ${tika:zcml}
+eggs += ftw.tika
 
 [client2]
 <= client_base
 recipe = plone.recipe.zope2instance
 zeo-address = ${zeoserver:zeo-address}
 http-address = 8081
+zcml-additional += ${tika:zcml}
+eggs += ftw.tika
 
 ############################################
 # Versions Specification
@@ -197,3 +207,14 @@
 plone.recipe.unifiedinstaller = 4.3.2
 plone.recipe.command = 1.1
 plone.recipe.precompiler = 0.6
+
+certifi = 2017.11.5
+chardet = 3.0.4
+collective.recipe.scriptgen = 0.2
+ftw.tika = 2.9.0
+hexagonit.recipe.download = 1.7.1
+idna = 2.6
+requests = 2.18.4
+urllib3 = 1.22
+
+collective.documentviewer = 5.0.1

八、开始buildout

开始buildout前务必先停掉服务

cd /opt/plone/zeocluster
bin/plonectl stop

Plone不允许用root用户运行buildout,必须用普通用户sudo为plone_buildout角色运行。

Buildout should not be run while superuser. Doing so allows
untrusted code to be run as root.
Instead, you probably wish to do something like:
    sudo -u plone_buildout bin/buildout

If you have a good reason to bypass this restriction,
remove the buildout.sanitycheck extension from your buildout.

如果是新安装的wheezy,可能还不允许普通用户运行sudo,新建配置文件让普通用户可以运行sudo,假设普通用户账号为hero

joe /etc/sudoers.d/hero

内容只有一行

hero    ALL=(ALL:ALL) ALL

正常的情况下应该先设置buildout使用pypi的镜像甚至本地pypi库,否则全世界的buildout都用官网,速度肯定快不了,只需要在base.cfg的[buildout]段加入一行 index=http://mirrors.163.com/pypi/simple/

joe /opt/plone/zeocluster/base.cfg

[buildout]
...
...
index=http://mirrors.163.com/pypi/simple/

如果想完全重装,之前安装用到Plone-5.1.2-UnifiedInstaller/packages/Python-2.7.14.tgz,修改好了的buildout.cfg文件,以及目录/opt/plone/buildout-cache/downloads/dist/中的内容可以备份好,在相应时间节点拷贝回新安装的相同目录(要注意文件的拥有者和文件属性,后文会提到),以节约时间。用以下命令开始buildout, 见证噩梦的时刻到了。。。。。。。。。。。。。。。。。

hero@mydebian205:/opt/plone/zeocluster$ sudo -u plone_builout bin/buildout -vvv

buildout会遇到各种状况,buildout意外中断,buildout停止反应既不下载也不编译,buildout看似完成但服务无法启动,服务启动但用URL无法访问站点,可访问站点但组件没出现在面板中或未生效等不可预测的情况都有可能遇到。

最常见的情况是下载不顺问题,解决办法是按Ctrl+C中断进程,记下需要的文件和版本到https://pypi.org直接下载文件保存到以下目录,并设置文件拥有者和文件属性

ls /opt/plone/buildout-cache/downloads/dist/
chown -R plone_buildout:plone_group /opt/plone/buildout-cache/downloads/dist/
chmod -R 664 /opt/plone/buildout-cache/downloads/dist/

然后重新开始buildout

hero@mydebian205:/opt/plone/zeocluster$ sudo -u plone_builout bin/buildout -vvv

一遍又一遍,一遍又一遍,直到每次输入buildout命令准备按回车前都双手合十,求上苍保佑,才算进入角色了。毕竟本文只增加了两个组件ftw.tika和collective.documentviewer,坑不够大,在疯掉前还是有希望成功的。buildout顺利完成后显示的是picked的组件名和版本号。

为了加快速度,我修改了tika.cfg,将下载慢的两个最大文件用win下迅雷下载并上传到/opt/share目录中,改http://为file:///

#url = http://repo1.maven.org/maven2/org/apache/tika/tika-app/1.11/tika-app-1.11.jar
url = file:///opt/share/tika-app-1.11.jar

#url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.11/tika-server-1.11.jar
url = file:///opt/share/tika-server-1.11.jar

九、站点中安装组件和配置组件

用root权限启动服务,在客户机浏览器中访问站点,用admin登录(站点不登录是不可能被改动的,下文不再强调)

admin-网站设置-附加组件

可启用附加组件
Document Viewer
Installs the collective.documentviewer package – (collective.documentviewer 5.0.1)
警告 此附加组件无法卸载!
ftw.tika
Apache Tika integration for Plone – (ftw.tika 2.9.0)

先只安装ftw.tika,然后停掉服务

cd /opt/plone/zeocluster
bin/plonectl stop

分别在debian中开两个终端,分别用root权限运行tika-server和Plone服务,其中tika-server终端将滚屏显示(服务器上访问http://localhost:9998可以查看到tika的一个界面),Plone服务的会回到命令提示符下

cd /opt/plone/zeocluster
bin/tika-server
cd /opt/plone/zeocluster
bin/plonectl start

添加新的条目-文件

上传几个中文文件名中文内容的doc,docx,pdf(文本类,非图片类),测试全文搜索是否生效,注意是文件内容的全文检索,文件名的检索是Plone自带的,不需要tika,实际上如果debian中安装有wv(apt-get install wv),无tika组件的Plone也支持doc文件中文全文检索,但docx的全文检索是tika贡献的功能(支持全文检索的组件也不只tika一个,只是tika的前景应是最好的)。全文检索只是定位到文件的位置,并没有文件实际内容的预览。

全文检索功能正常后开始解决文档预览,Plone下文件无组件支持情况只能下载是无法预览内容的。

admin-网站设置-附加组件

安装Document Viewer

同tika不同这个组件自身还要配置

admin-网站设置--附加组件配置-文档管理系统设置-按文件类型自动布局

只有PDF被选中,增加钩选Word Document,保存

十、解决collective.documentviewer的BUG

上传一个中文内容的docx文件,发现collective.documentviewer没有生效,但是全文检索是有效的,网页中出现错误提示,但是点击Show Document viewer Conversion Error链接无效

Info There was an error trying to convert the document. Maybe the document is encrypted, corrupt or malformed? Check log for details.
测试.docx
Show Document Viewer Conversion Error

 用文本编辑器查看/opt/plone/zeocluster/var/client1/event.log文件末尾的内容

Traceback (most recent call last):
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 598, in __call__
    pages = self.run_conversion()
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 428, in run_conversion
    return docsplit.convert(self.storage_dir, **args)
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 324, in convert
    self.convert_to_pdf(path, filename, output_dir)
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 280, in convert_to_pdf
    self._run_command(cmd)
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 126, in _run_command
    raise Exception(error)
Exception: Command
/usr/local/bin/docsplit pdf /tmp/tmpdfnKDQ/dump.docx --output /tmp/tmpdfnKDQ
finished with return code
1
and output:

terminate called after throwing an instance of 'com::sun::star::uno::RuntimeException'
Aborted
/var/lib/gems/1.8/gems/docsplit-0.7.6/lib/docsplit/pdf_extractor.rb:33:in `libre_office?': undefined method `match' for nil:NilClass (NoMethodError)
        from /var/lib/gems/1.8/gems/docsplit-0.7.6/lib/docsplit/pdf_extractor.rb:128:in `extract'
        from /var/lib/gems/1.8/gems/docsplit-0.7.6/lib/docsplit/pdf_extractor.rb:120:in `each'
        from /var/lib/gems/1.8/gems/docsplit-0.7.6/lib/docsplit/pdf_extractor.rb:120:in `extract'
        from /var/lib/gems/1.8/gems/docsplit-0.7.6/lib/docsplit.rb:65:in `extract_pdf'
        from /var/lib/gems/1.8/gems/docsplit-0.7.6/bin/../lib/docsplit/command_line.rb:47:in `run'
        from /var/lib/gems/1.8/gems/docsplit-0.7.6/bin/../lib/docsplit/command_line.rb:37:in `initialize'
        from /var/lib/gems/1.8/gems/docsplit-0.7.6/bin/docsplit:5:in `new'
        from /var/lib/gems/1.8/gems/docsplit-0.7.6/bin/docsplit:5
        from /usr/local/bin/docsplit:23:in `load'
        from /usr/local/bin/docsplit:23

------
2018-08-10T01:42:44 INFO ftw.tika Converting document with tika JAXRS server: 测试.docx

在网上搜索了一下有两篇文章似乎提供了解决办法

 一是

https://github.com/collective/collective.documentviewer/issues/11

 建议修改docsplit的组件/var/lib/gems/1.8/gems/docsplit-0.7.6/lib/docsplit/pdf_extractor.rb

二是

https://pypi.org/project/collective.documentviewer/

建议修改/tmp和/var/tmp的权限,增加粘滞位。

经实测都不解决问题。继续分析出错日志中,运行出错命令是

/usr/local/bin/docsplit pdf /tmp/tmpdfnKDQ/dump.docx --output /tmp/tmpdfnKDQ

查看一下相应目录

# ls -l /tmp/tmpdfnKDQ
-rw------- 1 plone_daemon plone_group  41139  8月 10 01:42 dump.docx

 用root用户执行出错日志中的命令

# /usr/local/bin/docsplit pdf /tmp/tmpdfnKDQ/dump.docx --output /tmp/tmpdfnKDQ

居然没有报错

用root用户查看一下相应目录

# ls -l /tmp/tmpdfnKDQ
-rw------- 1 plone_daemon plone_group  41139  8月 10 01:42 dump.docx
-rw-r--r-- 1 root         root        143352  8月 10 01:51 dump.pdf
drwxr-xr-x 3 root         root          4096  8月 10 01:51 libreoffice

发现转换pdf文件已成功,用相应软件打开这个pdf也正常。既然root用户可以,而plone_daemon用户不行那一定是权限问题,排除pdf_extractor.rb的问题,因为那篇文章解决的是不能识别LiberOffice的问题,症状应该是root用户或任何用户都运行出错。增加粘滞位是从权限角度,但是很容易证明也不能解决问题。

我想既然root用户运行可行,那就让代码调用docsplit时运行sudo docsplit,问题在于sudo时会要求输入root密码,只适用交互界面,代码需要附加的解决办法,后来发现sudoers可以配置成不需要root密码。这当然会有一定的安全问题,本文只是原型测试,只能先把功能搞定,以后有时间再去找最优方案,新建配置文件

joe /etc/sudoers.d/plone_daemon

只有一行,为什么要加入/bin/rm后文会解释

plone_daemon ALL = NOPASSWD:/usr/local/bin/docsplit,/bin/rm

 joe /opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py找到271行

    def convert_to_pdf(self, filepath, filename, output_dir):
        # get ext from filename
        ext = os.path.splitext(os.path.normcase(filename))[1][1:]
        inputfilepath = os.path.join(output_dir, 'dump.%s' % ext)
        shutil.move(filepath, inputfilepath)
        orig_files = set(os.listdir(output_dir))
        cmd = [
            self.binary, 'pdf', inputfilepath,
            '--output', output_dir]
        self._run_command(cmd)

 在self.binary前加入 '/usr/bin/sudo',

    def convert_to_pdf(self, filepath, filename, output_dir):
        # get ext from filename
        ext = os.path.splitext(os.path.normcase(filename))[1][1:]
        inputfilepath = os.path.join(output_dir, 'dump.%s' % ext)
        shutil.move(filepath, inputfilepath)
        orig_files = set(os.listdir(output_dir))
        cmd = [
            '/usr/bin/sudo', self.binary, 'pdf', inputfilepath,
            '--output', output_dir]
        self._run_command(cmd)

 重启服务,继续测试,仍然出错,界面出错信息没有任何有用的信息,继续分析LOG,用文本编辑器查看/opt/plone/zeocluster/var/client1/event.log文件末尾的内容,内容改变了,好兆头。LOG显示“---sudo docsplit pdf---"部分已完成了,出错的是后续部分

------
2018-08-10T22:02:54 INFO collective.documentviewer Running command /usr/bin/sudo /usr/local/bin/docsplit pdf /tmp/tmpBsifva/dump.docx --output /tmp/tmpBsifva
------
2018-08-10T22:03:07 INFO collective.documentviewer Finished Running Command /usr/bin/sudo /usr/local/bin/docsplit pdf /tmp/tmpBsifva/dump.docx --output /tmp/tmpBsifva
------
2018-08-10T22:03:07 ERROR collective.documentviewer Error converting PDF:

Traceback (most recent call last):
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 598, in __call__
    pages = self.run_conversion()
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 428, in run_conversion
    return docsplit.convert(self.storage_dir, **args)
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 324, in convert
    self.convert_to_pdf(path, filename, output_dir)
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 289, in convert_to_pdf
    shutil.rmtree(libreOfficePath)
  File "/opt/plone/Python-2.7/lib/python2.7/shutil.py", line 261, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/opt/plone/Python-2.7/lib/python2.7/shutil.py", line 253, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/opt/plone/Python-2.7/lib/python2.7/shutil.py", line 251, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/tmp/tmpBsifva/libreoffice/3'
Traceback (most recent call last):
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 598, in __call__
    pages = self.run_conversion()
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 428, in run_conversion
    return docsplit.convert(self.storage_dir, **args)
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 324, in convert
    self.convert_to_pdf(path, filename, output_dir)
  File "/opt/plone/buildout-cache/eggs/collective.documentviewer-5.0.1-py2.7.egg/collective/documentviewer/convert.py", line 289, in convert_to_pdf
    shutil.rmtree(libreOfficePath)
  File "/opt/plone/Python-2.7/lib/python2.7/shutil.py", line 261, in rmtree
    rmtree(fullname, ignore_errors, onerror)
  File "/opt/plone/Python-2.7/lib/python2.7/shutil.py", line 253, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/opt/plone/Python-2.7/lib/python2.7/shutil.py", line 251, in rmtree
    names = os.listdir(path)
OSError: [Errno 13] Permission denied: '/tmp/tmpBsifva/libreoffice/3'
------
2018-08-10T22:03:07 INFO ftw.tika Converting document with tika JAXRS server: 测试.docx

出错的是shutil.rmtree(libreOfficePath),并且是permission denied错误,经过分析,/tmp/tmpBsifva临时目录中除了生成新的pdf文件外还有一个libreoffice目录,由于docsplit是sudo为root权限建立的,plone_daemon用户没有权限删除这个目录导至出错,解决办法就是删除目录也用sudo调用的/bin/rm代替shutil.rmtree。/etc/sudoers中已经加好了plone_daemon用户无需root密码sudo运行/bin/rm,按相同原则将所有docsplit调用前都加上'/usr/bin/sudo',对应四个参数 "images","text","length",'pdf'所在行。再将shutil.rmtree(libreOfficePath)和shutil.rmtree(storage_dir)都改成系统调用

os.system('/usr/bin/sudo /bin/rm -fr %s' % (libreOfficePath,))
os.system('/usr/bin/sudo /bin/rm -fr %s' % (storage_dir,))

 测试过程中还出现过finished with return code ...... and output:后的内容encodeing编码出错,找到106行

    def _run_command(self, cmd):
        if isinstance(cmd, basestring):
            cmd = cmd.split()
        cmdformatted = ' '.join(cmd)
        logger.info("Running command %s" % cmdformatted)
        process = subprocess.Popen(
            cmd, stdout=subprocess.PIPE,
            stderr=subprocess.PIPE, close_fds=self.close_fds)
        output, error = process.communicate()
        process.stdout.close()
        process.stderr.close()
        if process.returncode != 0:
            error = """Command
%s
finished with return code
%i
and output:
%s
%s""" % (cmdformatted, process.returncode, output, error)
            logger.info(error)
            raise Exception(error)
        logger.info("Finished Running Command %s" % cmdformatted)
        return output

 显示出错信息时出错,好奇葩,没工夫去解决,搞定了还是不能解决最初的错误,把output, error和对应的两个%s删掉了事。

生成patch:diff -uN convert.py convert.py.ok >convert501.py.diff,以后重装时在convert.py同一目录中patch -p0 <convert501.py.diff

--- convert.py	2018-08-10 23:14:20.068147869 +0800
+++ convert.py.ok	2018-08-10 23:02:37.864147759 +0800
@@ -120,8 +120,7 @@
 finished with return code
 %i
 and output:
-%s
-%s""" % (cmdformatted, process.returncode, output, error)
+""" % (cmdformatted, process.returncode,)
             logger.info(error)
             raise Exception(error)
         logger.info("Finished Running Command %s" % cmdformatted)
@@ -224,7 +223,7 @@
         # docsplit images pdf.pdf --size 700x,300x,50x
         # --format gif --output
         cmd = [
-            self.binary, "images", filepath,
+            '/usr/bin/sudo', self.binary, "images", filepath,
             '--language', lang,
             '--size', ','.join([str(s[1]) + 'x' for s in sizes]),
             '--format', format,
@@ -251,7 +250,7 @@
         output_dir = os.path.join(output_dir, TEXT_REL_PATHNAME)
         ocr = not ocr and 'no-' or ''
         cmd = [
-            self.binary, "text", filepath,
+            '/usr/bin/sudo', self.binary, "text", filepath,
             '--language', lang,
             '--%socr' % ocr,
             '--pages', 'all',
@@ -265,7 +264,7 @@
         self._run_command(cmd)
 
     def get_num_pages(self, filepath):
-        cmd = [self.binary, "length", filepath]
+        cmd = ['/usr/bin/sudo', self.binary, "length", filepath]
         return int(self._run_command(cmd).strip())
 
     def convert_to_pdf(self, filepath, filename, output_dir):
@@ -275,7 +274,7 @@
         shutil.move(filepath, inputfilepath)
         orig_files = set(os.listdir(output_dir))
         cmd = [
-            self.binary, 'pdf', inputfilepath,
+            '/usr/bin/sudo', self.binary, 'pdf', inputfilepath,
             '--output', output_dir]
         self._run_command(cmd)
 
@@ -286,7 +285,9 @@
         # folder next to the generated PDF, removes it!
         libreOfficePath = os.path.join(output_dir, 'libreoffice')
         if os.path.exists(libreOfficePath):
-            shutil.rmtree(libreOfficePath)
+            os.system('/usr/bin/sudo /bin/rm -fr %s' % (libreOfficePath,))
+            #shutil.rmtree(libreOfficePath)
+            pass
 
         # move the file to the right location now
         files = set(os.listdir(output_dir))
@@ -481,7 +482,8 @@
                     files[filename] = saveFileToBlob(filepath)
 
             settings.blob_files = files
-            shutil.rmtree(storage_dir)
+            os.system('/usr/bin/sudo /bin/rm -fr %s' % (storage_dir,))
+            #shutil.rmtree(storage_dir)
 
             # check for old storage to remove... Just in case.
             old_storage_dir = os.path.join(gsettings.storage_location,

重启plone让改动生效后上传docx成功预览,原型测试结束。

=================================

其他值得探索的功能及需求

一、异步支持

上传一个文件时,转换预览很慢,collective.documentviewer同时支持plone.app.async和collective.celery进行异步转换,上传时可以迅速返回,实际转换在后台运行,还可以查看进度,但是网上的文章都是Plone4的,只找到一篇Plone5使用collective.celery的文章。

https://www.codesyntax.com/en/blog/collective-documentviewer-with-redis-backed-celery-tasks-on-plone-4-and-5

文章中两个指向https://gist.github.com的链接,不能直接打开,但网页的确存在,原因~!@#¥%&*()+

二、中文目录(路径)中文文件名的ID

已上传的文件ID变成ASCII码和数字构成,下载时原文件名被破坏,作为一个文档管理系统,文件名也是一种重要信息,最理想情况下可以用FTP将整个目录和子目录及文件上传到Plone,Plone提供可识别的文档全文检索和预览功能,不应该破坏原信息,必要时还要可以用FTP从Plone原封不动地下载回来。

三、搜索结果及目录内容列表时的截取部分匹配上下文预览

搜索引擎都有这样的功能,搜索结果及目录内容列表时每一条目下应有其部分文本内容的截取,搜索结果中最理想情况下应截取出现关键词的前后文,同时关键词突出显示。这个功能对用户快速定位自己需要的条目非常重要,用户不必一个个点开全文预览。

四、权限和工作流

禁止匿名用户查看任何内容,对登录用户也有部分内容保密,不会在搜索结果中包含。

五、与apache或nginx的整合

新建Plone站点时有一个选项似乎与Plone直接将文件系统资源提供用户访问相关,但更通用的情况是与apache或nginx的整合

Static resource storage
A folder for storing and serving static resource files

一是超大文件应存在文件系统中,或者已经在文件系统中,整合apache或nginx性能上有优势。整合还有登录认证的整合

六、嵌入在线视频播放器、图片thumb、代码高亮

视频、图片、代码也是重要的文档形式,

视频要支持本地视频,也要支持远程视频服务器提供的资源。

网上有一个项目plumi是基于Plone4.2的视频分享项目

https://plumi.org
https://github.com/plumi/plumi.app/

这个项目使用的播放器是

https://pypi.org/project/collective.flowplayer/

这个组件的主页上显示支持Plone4,不过嵌入在线视频播放器应该不难办到。

七、pin所有组件版本

如果不pin所有组件版本,在将来buildout时可能会取来最新版本组件,也许就不支持Plone5了,为了防止这种情况,将所有组件版本号在buildout.cfg的[versions]固定。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值