Elasticsearch版本:6.5.4
python elasticsearch包版本:7.9.1
Elasticsearch安装
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.4.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.4.tar.gz.sha512
shasum -a 512 -c elasticsearch-6.5.4.tar.gz.sha512
tar -xzf elasticsearch-6.5.4.tar.gz
cd elasticsearch-6.5.4
./bin/elasticsearch
如果一切正常,Elasticsearch 就会在默认的9200端口运行。这时,打开另一个命令行窗口,请求该端口,会得到说明信息。
{
"name" : "4i-Vnu0",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "-GKnecjfT46y3n-CMs44dg",
"version" : {
"number" : "6.5.4",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "d2ef93d",
"build_date" : "2018-12-17T21:17:40.758843Z",
"build_snapshot" : false,
"lucene_version" : "7.5.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
上面代码中,请求9200端口,Elasticsearch返回一个 JSON 对象,包含当前节点、集群、版本等信息。按下 Ctrl + C,Elasticsearch就会停止运行。
默认情况下,Elasticsearch只允许本机访问,如果需要远程访问,可以修改 Elasticsearch 安装目录的config/elasticsearch.yml文件,去掉network.host的注释,将它的值改成0.0.0.0,然后重新启动 Elasticsearch。
network.host: 0.0.0.0
使用Python与Elasticsearch交互
首先需要安装与Elasticsearch 6.5.4兼容的elasticsearch包elasticsearch 7.9.1。注意,当设置network.host为0.0.0.0后,下文中ip应替换为具体的ip地址。
设置Elasticsearch,并查询Elastcsearch的基本信息。
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'ip', 'port': 9200}])
# 查询当前节点、集群、版本等信息。
result = requests.get('http://ip:9200')
print(result.content.decode())
# 查看当前节点的所有Index。
result = requests.get('http://ip:9200/_cat/indices?v')
print(result.content.decode())
将数据预先保存为csv文件格式,并使用helpers批量上传数据。
from elasticsearch import helpers
with open(file_name, 'r', encoding='utf-8') as f:
reader = csv.DictReader(f, fieldnames=['surface_form', 'score', 'mid'])
helpers.bulk(es, reader, index='freebase_surface_map', doc_type="doc")
使用Elasticsearch进行查询,并进行数据后处理。
candidate = 'country'
# surface_form 为需要匹配的字段。
query = {'query': {'match': {'surface_form': candidate}}}
es_results = es.search(index='freebase_surface_map', doc_type="doc", size=50, body=query)
max_score = es_results['hits']['max_score']
es_results = es_results['hits']['hits']
# 仅考虑分数最高的候选项。
es_results = [es_result for es_result in es_results if es_result['_score'] == max_score]
es_results = [es_result['_source'] for es_result in es_results]
es_results = sorted(es_results, key=lambda x: x['score'], reverse=True)