日期:2016年10月13日
标题:pylucene,分词,语言编码问题
编号:3
一.pylucene
- pylucene以及前置包的安装
-
- 安装Java(JDK)
-
- sudo apt-get install default-jdk
- 输入javac以测试
- 安装python-dev
-
- sudo apt-get install python-dev
- 安装ant
-
- sudo apt-get install ant
- 安装jcc(首先检查是否已经安装g++和gcc)
-
- 法1:sudo easy_install jcc (//不靠谱)
- 法2:
svn co http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc jcc
(转换到jcc源码目录)
/*修改jcc目录下的setup.py文件,把JDK这个变量 对应的值改成你系统上的值。
原来默认的是这样的:
JDK = {
'darwin': '/System/Library/Frameworks/JavaVM.framework/Versions/Current',
'ipod': '/usr/include/gcc',
'linux2': '/usr/lib/jvm/java-6-openjdk',
'sunos5': '/usr/jdk/instances/jdk1.6.0',
'win32': 'o:/Java/jdk1.6.0_02',
}*/
python setup.py build
sudo python setup.py install
-
- 安装pylucene:
-
- 首先到官网下包
- 接着解压
- pushd jcc
- <edit setup.py to match your environment>
- python setup.py build
- sudo python setup.py install
- popd
- <edit Makefile to match your environment>
- make
- make test (look for failures)
- sudo make install
-
- 或者在anaconda平台下使用:
- 首先在你的虚拟机上安装anaconda环境,下载链接:https://www.continuum.io/downloads#linux (记得下载py27的)
- Cd到下载目录,在terminal中输入:
- bash Anaconda2-4.2.0-Linux-x86_64.sh
来安装anaconda环境
安装成功后使用which python命令查看python的路径是否已经指向你的anaconda的python,如果没有的话就手动加入
来安装anaconda环境
安装成功后使用which python命令查看python的路径是否已经指向你的anaconda的python,如果没有的话就手动加入
- 之后,在terminal里面输入:
conda install -c kalefranz pylucene=4.9.0
就会看到anaconda会自动帮你把lucene与其前置库(包括jdk的静态库以及jcc)全部安装,完全不需要再去安装及配置java和jcc啦!
conda install -c kalefranz pylucene=4.9.0
就会看到anaconda会自动帮你把lucene与其前置库(包括jdk的静态库以及jcc)全部安装,完全不需要再去安装及配置java和jcc啦!
- 但是这个时候你如果在python里面import jcc会报缺少lib的错误,因此我们需要在terminal里面制定lib的路径:
export PREFIX=/你/的/anaconda/的/路/径/
export PREFIX=/你/的/anaconda/的/路/径/
export LD_LIBRARY_PATH=$PREFIX/lib:$PREFIX/jre/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$PREFIX/jre/lib/amd64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$PREFIX/jre/lib/amd64/server:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$PREFIX/lib/python2.7/site-packages:$LD_LIBRARY_PATH
-
- NOTE:注意pylucene3和pylucene4的语法区别
- lucene的实现原理:见:http://www.cnblogs.com/forfuture1978/archive/2009/12/14/1623594.html
- pylucene的代码分析:
#!/usr/bin/env python
#IndexFiles.py
INDEX_DIR = "IndexFiles.index"
import sys, os, lucene, threading, time
from datetime import datetime
#4版本的pylucene必须这样导入
from java.io import File
from org.apache.lucene.analysis.miscellaneous import LimitTokenCountAnalyzer
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, FieldType
from org.apache.lucene.index import FieldInfo, IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.analysis.core import SimpleAnalyzer
from org.apache.lucene.util import Version
"""
This class is loosely based on the Lucene (java implementation) demo class