搜索引擎–elasticsearch python客户端pyes 建立索引和搜索

最新推荐文章于 2024-03-12 11:16:40 发布

iteye_12675

最新推荐文章于 2024-03-12 11:16:40 发布

阅读量204

点赞数

文章标签：大数据 python json

主机环境:Ubuntu 13.04

Python版本：2.7.4

转载请标明：http://blog.yanming8.cn/archives/118

官方站点：http://www.elasticsearch.com/

中文站点：http://es-cn.medcl.net/

下面一段介绍引用自中文站点：

好吧，假如你建了一个web站点或者是一个应用程序，你就可能会需要添加搜索功能（因为这太有必要了），而事实上让搜索跑起来是有难度的，我们不仅想要搜索的速度快，而且还要安装方便（最好是无痛安装），另外模式定义要非常自由（schema free），可以通过HTTP以JSON格式的数据来进行索引，服务器必须是一直可用的（HA高可用，这个不能丢），从一台机器能够扩展到成千上万台，然后搜索必须是实时的（real-time），使用起来一定要简单、支持多租户，我们需要一整套的解决方案，并且是为云构建的。
“让搜索更简单”，这是我们的宣言，“并且要酷，像盆景一样”
elasticsearch的目标是解决上面的所有问题以及更多。她是开源的（Apache2协议），分布式的，RESTful的，构建在Apache Lucene之上的的搜索引擎.

1 、分布式服务器的安装：

首先下载http://www.elasticsearch.org/download/，选择合适的版本安装，这里直接下载了适合ubuntu的DEB包，下载完成后直接dpkg命令安装。安装完成后可以通过

sudo service elasticsearch start

来启动服务。

2、安装pyes客户端

使用命令

`1`	`pip install pyes`

安装elasticsearch的python的组件。

3、安装pyes的中文分词组件

直接下载https://github.com/medcl/elasticsearch-rtf/blob/master/elasticsearch/plugins/analysis-ik/elasticsearch-analysis-ik-1.2.2.jar中文分词组件

然后移动的elasticsearch的安装目录/usr/share/elasticsearch/analysis-ik/,

修改配置文件/etc/elasticsearch/elasticsearch.yml

设置插件的路径

path.plugins: /usr/share/elasticsearch/plugins

并添加分词组建配置

 
    1index:
 
    2analysis:
 
    3analyzer:
 
    4ik:
 
    5alias: [ik_analyzer]
 
    6type: org.elasticsearch.index.analysis.IkAnalyzerProvider

最后下载IK分词使用的词典

cd /etc/elasticsearch
wget http://github.com/downloads/medcl/elasticsearch-analysis-ik/ik.zip –no-check-certificate
unzip ik.zip
rm ik.zip

重启elasticsearch服务即可。

4、建立索引

 
    01#!/usr/bin/env python
 
    02#-*- coding:utf-8-*-
 
    03importos
 
    04importsys
 
    05frompyesimport*
 
    06
 
    07INDEX_NAME='txtfiles'
 
    08
 
    09classIndexFiles(object):
 
    10def__init__(self,root):
 
    11conn=ES('127.0.0.1:9200', timeout=3.5)#连接ES
 
    12try:
 
    13conn.delete_index(INDEX_NAME)
 
    14#pass
 
    15except:
 
    16pass
 
    17conn.create_index(INDEX_NAME)#新建一个索引
 
    18
 
    19#定义索引存储结构
 
    20mapping={u'content': {'boost':1.0,
 
    21'index':'analyzed',
 
    22'store':'yes',
 
    23'type': u'string',
 
    24"indexAnalyzer":"ik",
 
    25"searchAnalyzer":"ik",
 
    26"term_vector":"with_positions_offsets"},
 
    27u'name': {'boost':1.0,
 
    28'index':'analyzed',
 
    29'store':'yes',
 
    30'type': u'string',
 
    31"indexAnalyzer":"ik",
 
    32"searchAnalyzer":"ik",
 
    33"term_vector":"with_positions_offsets"},
 
    34u'dirpath': {'boost':1.0,
 
    35'index':'analyzed',
 
    36'store':'yes',
 
    37'type': u'string',
 
    38"indexAnalyzer":"ik",
 
    39"searchAnalyzer":"ik",
 
    40"term_vector":"with_positions_offsets"},
 
    41}
 
    42
 
    43conn.put_mapping("test-type", {'properties':mapping}, [INDEX_NAME])#定义test-type
 
    44
 
    45self.addIndex(conn,root)
 
    46
 
    47conn.default_indices=[INDEX_NAME]#设置默认的索引
 
    48conn.refresh()#刷新以获得最新插入的文档
 
    49
 
    50defaddIndex(self,conn,root):
 
    51printroot
 
    52forroot, dirnames, filenamesinos.walk(root):
 
    53forfilenameinfilenames:
 
    54ifnotfilename.endswith('.txt'):
 
    55continue
 
    56print"Indexing file ", filename
 
    57try:
 
    58path=os.path.join(root,filename)
 
    59file=open(path)
 
    60contents=unicode(file.read(),'utf-8')
 
    61file.close()
 
    62iflen(contents) >0:
 
    63conn.index({'name':filename,'dirpath':root,'content':contents},INDEX_NAME,'test-type')
 
    64else:
 
    65print'no contents in file %s',path
 
    66exceptException,e:
 
    67printe
 
    68
 
    69if__name__=='__main__':
 
    70IndexFiles('./txtfiles')

5、搜索并高亮显示

 
    01#!/usr/bin/env python
 
    02#-*- coding:utf-8 -*-
 
    03
 
    04importos
 
    05importsys
 
    06frompyesimport*
 
    07
 
    08conn=ES('127.0.0.1:9200', timeout=3.5)#连接ES
 
    09sq=StringQuery(u'世界末日','content')
 
    10h=HighLighter(['<b>'], ['</b>'], fragment_size=20)
 
    11
 
    12s=Search(sq,highlight=h)
 
    13s.add_highlight("content")
 
    14results=conn.search(s,indices='txtfiles',doc_types='test-type')
 
    15
 
    16list=[]
 
    17forrinresults:
 
    18if(r._meta.highlight.has_key("content")):
 
    19r['content']=r._meta.highlight[u"content"][0]
 
    20list.append(r)
 
    21printr['content']
 
    22printlen(list)