1. 批量插入与更新
本文介绍工作中Python版常用的高效ES批量插入、更新数据方式
1.1 批量插入
import pandas as pd
from elasticsearch import helpers
actions = list()
count = 0
for index, item in merged_df.iterrows():
// 过滤nan值
filted_item = dict(filter(lambda x: pd.notna(x[1]),item.items()))
action = {
"_op_type": "index", // index update
"_index": "community_summary", // 索引名
"_id": item['id'], // 文档ID
"_source": filted_item // 文档值
}
actions.append(action)
if len(actions) == 1000:
// 批量写入
helpers.bulk(es12_client.elastic_client, actions)
count += len(actions)
print(count)
actions.clear()
if len(actions) > 0:
helpers.bulk(es12_client.elastic_client, actions)
count += len(actions)
print(count)
actions.clear()
1.2 批量更新
批量更新只需要改动action的以下内容即可
action = {
'_op_type': 'update', // 此处改为update
'_index': item['index'],
'_id': item_['_id'],
'doc': {'estate_type': item['映射物业类型']} // key值改为doc即可,有的需要改成_source,暂不知原因
}
2. 快照备份还原
2.1 SSHFS挂载
- 上传Elasticsearch\fuse-2.9.2-11.el7.x86_64.rpm至ES所在三台机器
- 上传Elasticsearch\fuse-libs-2.9.2-11.el7.x86_64.rpm至ES所在三台机器
- 上传Elasticsearch\fuse-sshfs-2.10-1.el7.x86_64.rpm至ES所在三台机器
rpm -ivh fuse-libs-2.9.2-11.el7.x86_64.rpm fuse-2.9.2-11.el7.x86_64.rpm fuse-sshfs-2.10-1.el7.x86_64.rpm
- 在单独一台机器执行操作
mkdir /home/elasticsearch
mkdir /home/elasticsearch/backup
useradd elasticsearch
passwd elasticsearch(本次密码设置为:xxxxx)
chown -R elasticsearch:elasticsearch /home/elasticsearch/
- 在ES三台机器分别执行挂载
sshfs -o allow_other -o nonempty root@192.168.0.102:/home/elasticsearch/backup /home/elasticsearch/backup
sshfs -o allow_other -o nonempty root@192.168.0.102:/home/elasticsearch/backup /home/elasticsearch/backup
sshfs -o allow_other -o nonempty root@192.168.0.102:/home/elasticsearch/backup /home/elasticsearch/backup
- 创建测试文件看是否挂载成功
切换到elasticsearch用户
mkdir /home/elasticsearch/backup/test
- 发现另外三台机器其他目录下也存在该目录,正确
2.2 快照还原(数据更新)
-
创建快照存储库
进入kibana,如图所示
-
创建快照库名称es_backup,选择共享文件系统
-
输入文件系统位置 /home/elasticsearch/backup
-
点击验证存储库,显示连接成功则正常
-
快照还原
以elasticsearch用户登录ftp,上传Elasticsearch/backup下文件至86.1.72.102 /home/elasticsearch/backup下
-
耐心等待,上传成功后,可以在kibana中快照中看到
-
点击最新时间点还原,按照默认选项下一步,进行还原即可,还原后所有状态均为已完成
3. top_hits去重返回内容唯一值
我们在之前有了解过cardinality基于统计去重,当我们需要基于内容去重时,需要用到top_hits
如需求:统计客户地址所在小区的信息,多个客户地址可能位于一个小区,所以需要内容去重
# _source中指定要返回的小区字段信息,size=1表示只取第一个
POST customer/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"city": {
"value": "上海"
}
}
}
]
}
},
"aggs": {
"seaweed_id": {
"terms": {
"field": "seaweed_id",
"size": 2000
},
"aggs": {
"top_hits": {
"top_hits": {
"_source": {
"includes": [
"city",
"region",
"name",
"location"
]
},
"size": 1
}
}
}
}
}
}
返回结果示例:
{
...
"aggregations" : {
"seaweed_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "51ff8167d5fd56fe5ac14",
"doc_count" : 18,
"top_hits" : {
"hits" : {
"total" : {
"value" : 18,
"relation" : "eq"
},
"max_score" : 0.001962709,
"hits" : [
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "a5cf9e2cf55fb1f788d2fcdfe4d",
"_score" : 0.001962709,
"_source" : {
"city" : "上海",
"name" : "育秀东区",
"location" : {
"lon" : 121.469335,
"lat" : 30.907972
},
"region" : "奉贤"
}
}
]
}
}
}
...
JAVA代码部分示例:
private void buildAggParam(Map<String, String> aggMap, AggregationBuilder aggregationBuilder) {
for (String key : aggMap.keySet()) {
if ("count".equals(aggMap.get(key))) {
aggregationBuilder.subAggregation(AggregationBuilders.count(key).field(key));
} else if ("avg".equals(aggMap.get(key))) {
aggregationBuilder.subAggregation(AggregationBuilders.avg(key).field(key));
} else if ("distinct".equals(aggMap.get(key))) {
aggregationBuilder.subAggregation(AggregationBuilders.cardinality(key).field(key));
} else if ("sum".equals(aggMap.get(key))) {
aggregationBuilder.subAggregation(AggregationBuilders.sum(key).field(key));
} else if ("top_hits".equals(aggMap.get(key))) {
aggregationBuilder.subAggregation(AggregationBuilders.topHits(key).fetchSource(key.split(","), null).size(1));
}
}
}
private void parseAggResult(Map<String, Map<String, String>> resultMap, Object key2, Aggregations aggregations) {
Map<String, String> subMap = new HashMap<>();
String key = String.valueOf(key2);
resultMap.put(key, subMap);
Map<String, Aggregation> aggregationMap = aggregations.getAsMap();
for (String subKey : aggregationMap.keySet()) {
Aggregation aggregation = aggregationMap.get(subKey);
String subVal = "-";
if ("avg".equals(aggregation.getType())) {
double value = ((ParsedAvg) aggregation).getValue();
if ((int) value != value) {
subVal = String.valueOf(value);
} else {
subVal = String.valueOf((int) value);
}
} else if ("value_count".equals(aggregation.getType())) {
subVal = String.valueOf((int) ((ParsedValueCount) aggregation).getValue());
} else if ("cardinality".equals(aggregation.getType())) {
subVal = String.valueOf((int) ((ParsedCardinality) aggregation).getValue());
} else if ("sum".equals(aggregation.getType())) {
subVal = String.valueOf((int) ((ParsedSum) aggregation).getValue());
} else if ("top_hits".equals(aggregation.getType())) {
SearchHit searchHit = ((ParsedTopHits)aggregation).getHits().getHits()[0];
subVal = searchHit.getSourceAsString();
}
subMap.put(subKey, subVal);
}
}