2020.12-elasticsearch学习-使用logstash从本地SQL类数据库中向elasticsearch导入数据
目录
运行最基本的 Logstash 管道来测试 Logstash 安装
创建一个自己的 .conf 文件,并存于文件系统的一个目录下
从mysql等服务中,利用logstash向elasticsearch导入数据
将mysql中的数据导入到es中,并且要求在导入的过程中可以使用ik+pinyin自定义的analyzer
遗留问题1:查询过程中在template中配置的“search_analyzer”参数的作用是什么
加载个人设置的template不成功,报错:[logstash.outputs.elasticsearch] Failed to install template.
Reference
如何安装elasticsearch栈中的logstash:https://blog.csdn.net/UbuntuTouch/article/details/99655350
从mariadb向elasticsearch中导入数据:https://www.cnblogs.com/mikeluwen/p/7686663.html
找到写的较好的可用的博客,目前解决办法大部分参照的是这篇博客:https://blog.csdn.net/andy511823558/article/details/106312759
template导入失败问题说明博客,在不同版本:https://blog.csdn.net/u014646662/article/details/94718834
解决elasticsearch6.x版本macos启动权限问题(不受信任的开发者):https://blog.csdn.net/qq_42797418/article/details/107838127?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2.control
如何安装elasticsearch栈中的logstash
mac&linux
curl -L -O https://artifacts.elastic.co/downloads/logstash/logstash-7.3.0.tar.gz
tar -xzvf logstash-7.3.0.tar.gz
注意curl命令在哪个界面运行,就会把文件下载到哪个界面,在实际使用过程中发现该方法下载过慢
华为云镜像下载网站:
https://mirrors.huaweicloud.com/logstash/7.3.0/logstash-7.3.0.tar.gz
运行最基本的 Logstash 管道来测试 Logstash 安装
cd logstash-7.3.0
bin/logstash -e 'input { stdin { } } output { stdout {} }'
在尝试过程中报错
Logstash stopped processing because of an error: (ArgumentError) invalid byte sequence in US-ASCII
该错误是由于使用路径中包括中文字符导致的,修改中文路径名称问题后,成功启动elasticsearch
在stdin中输入一些字符串进行测试:
创建一个自己的 .conf 文件,并存于文件系统的一个目录下
heartbeat.conf(可以放置在其他目录下,例如/Desktop/heartbeat.conf)
input {
heartbeat {
interval => 10
type => "heartbeat"
}
}
output {
stdout {
codec => rubydebug
}
}
通过执行如下命令,可以“运行”该conf文件(heartbeat心跳)
./bin/logstash -f /Desktop/heartbeat.conf
logstash-7.3.0 $ ./bin/logstash -f /Users/curious/Desktop/heartbeat.conf
从mysql等服务中,利用logstash向elasticsearch导入数据
安装插件
logstash-7.3.0 $ bin/logstash-plugin install logstash-input-jdbc
Validating logstash-input-jdbc
Installing logstash-input-jdbc
Installation successful
下载安装包
https://dev.mysql.com/downloads/connector/j/,选择不依赖于平台性的安装版本
解压缩后如下图所示:
拷贝mysql驱动jar包到如下目录:
/logstash-7.3.0/logstash-core/lib/jars/mysql-connector-java-8.0.21.jar
将mysql中的数据导入到es中,并且要求在导入的过程中可以使用ik+pinyin自定义的analyzer
该步骤主要是配置.conf文件,以及在导入过程中使用的mapping模板,核心 .conf文件如下所示:
在logstash目录下执行命令:
logstash-7.3.0 $ ./bin/logstash -f ./csdn.conf 其中.conf文件可以使用绝对路径
整体文件结构格式如下:
csdn.conf
input{
stdin{
}
jdbc{
jdbc_connection_string => "jdbc:mysql://localhost:3306/estest" # jdbc数据库连接
jdbc_user => "root"
jdbc_password => "数据库密码数据库密码数据库密码数据库密码"
jdbc_driver_library => "" # postgresql jar包文件路径
jdbc_driver_class => "com.mysql.jdbc.Driver" # jdbc数据库驱动
jdbc_paging_enabled => "true"
jdbc_page_size => "50000" # 同步数据分页设置
lowercase_column_names => "false" #是否将 column 名称转小写,如果不写,默认将字段转成小写,false则不转小写
#use_column_value => false #使用其它字段追踪,而不是用时间
#tracking_column => "UPDATETIME"
#statement_filepath => "/root/opt/logstash/bin/logstash_jdbc_test/jdbc.sql" # 需同步的数据执行的SQL文件路径
statement => ' select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ' # companyName起别名name
schedule => "* * * * *" # 各字段含义(由左至右)分、时、天、月、年,全部为*默认含义为每分钟都更新
type => "postgis_test"
}
}
filter {
mutate {
# add_field => {"DATATYPE"=>"PT"}
# add_field => {"LAYER"=>"postgis_test"}
# add_field => {"LOCATION" => "%{Y},%{X}"}
remove_field => "@version"
remove_field => "@timestamp"
}
}
output {
#stdout{
#codec => json_lines
#}
if [type] == "postgis_test"{
elasticsearch{
hosts => ["http://localhost:9200"] # ES连接
index => "postgis_test" # ES索引名称
document_id => "%{DZZYID}" #_id字段与 DZZYID 字段值一样,以便按照ID查询,_id是ES自带的字段,不能在input.statement中输入
#document_type => "doc"
manage_template => true
template_overwrite => true
template_name => "postgis_test"
template => "./postgis_test.json"
codec => json_lines
}
}
}
postgis_test.json
{
"template": "postgis_test",
"mappings": {
"doc": {
"properties": {
"DZZYID": {
"type": "keyword"
},
"DZZYMC_PINYIN": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer":"ik_pinyin_analyzer",
"search_analyzer": "ik_pinyin_analyzer"
}
}
}
},
"settings" : {
"analysis":{
"analyzer":{
"ik_pinyin_analyzer":{
"type":"custom",
"tokenizer":"ik_max_word",
"filter": "pinyin_filter"
}
},
"filter":{
"pinyin_filter":{
"type" : "pinyin",
"keep_separate_first_letter" : "false",
"keep_full_pinyin" : "true",
"keep_original" : "true",
"limit_first_letter_length" : "16",
"lowercase" : "true",
"remove_duplicated_term" : "true"
}
}
}
}
}
这种方法开始导入后,可通过pretty查看当前使用的mapping策略,可以看到ik_pinyin_analyzer被正确使用:
@staticmethod
def pretty(indexName):
url = "http://localhost:9200/" + str(indexName) + '?pretty'
payload={}
headers = {
'Content-Type': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=json.dumps(payload))
print("do pretty func: \n", json.loads(response.text))
ElasticsearchController.pretty(indexName)
do pretty func:
{'postgis_test': {'aliases': {}, 'mappings': {'doc': {'properties': {'DZZYID': {'type': 'keyword'}, 'DZZYMC_PINYIN': {'type': 'text', 'term_vector': 'with_positions_offsets', 'analyzer': 'ik_pinyin_analyzer'}, 'type': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}, 'settings': {'index': {'number_of_shards': '5', 'provided_name': 'postgis_test', 'creation_date': '1608016087623', 'analysis': {'filter': {'pinyin_filter': {'lowercase': 'true', 'keep_original': 'true', 'remove_duplicated_term': 'true', 'keep_separate_first_letter': 'false', 'type': 'pinyin', 'limit_first_letter_length': '16', 'keep_full_pinyin': 'true'}}, 'analyzer': {'ik_pinyin_analyzer': {'filter': 'pinyin_filter', 'type': 'custom', 'tokenizer': 'ik_max_word'}}}, 'number_of_replicas': '1', 'uuid': 'y1CqPflbTYWl45BDSXQ8Mw', 'version': {'created': '6030099'}}}}}
使用该种方法把sql数据导入到elasticsearch中,导入速度可超过1000条/s(短文本),数据库示例如下:
导入过程中,可以通过_count函数查看导入的数据计数count
def countIndex(indexName):
url = "http://localhost:9200/" + str(indexName) + '/_count'
payload={}
headers = {
'Content-Type': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=json.dumps(payload))
print("do countIndex func: \n", json.loads(response.text))
ElasticsearchController.countIndex(indexName='postgis_test')
>>>do countIndex func:
{'count': 423250, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}}
导入速度log:
[2020-12-15T16:22:03,627][INFO ][logstash.inputs.jdbc ] (0.124891s) SELECT * FROM ( select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 0
[2020-12-15T16:22:34,442][INFO ][logstash.inputs.jdbc ] (0.125908s) SELECT * FROM ( select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 50000
[2020-12-15T16:23:04,448][INFO ][logstash.inputs.jdbc ] (0.106033s) SELECT * FROM ( select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 100000
[2020-12-15T16:23:32,037][INFO ][logstash.inputs.jdbc ] (0.140638s) SELECT * FROM ( select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 150000
[2020-12-15T16:24:01,441][INFO ][logstash.inputs.jdbc ] (0.110575s) SELECT * FROM ( select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 200000
[2020-12-15T16:24:34,641][INFO ][logstash.inputs.jdbc ] (0.345005s) SELECT * FROM ( select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 250000
[2020-12-15T16:25:04,252][INFO ][logstash.inputs.jdbc ] (0.122336s) SELECT * FROM ( select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 300000
进行analyze,查询尝试,并对下一步内容进行规划
遗留问题1:查询过程中在template中配置的“search_analyzer”参数的作用是什么
在一些方法中看到在查询过程中设置了search_analyzer,该参数是否有作用,作用是什么?
遗留问题2:为什么查询结果非常差
在目前的pinyin+ik配置下,需要进一步了解elasticsearch的查询过程,通过学习不同查询模式,不同配置方式,进行查询过程的了解,例如search和searchByPhrase的区别
普通方法:
@staticmethod
def search(indexName, searchPayload):
url = "http://localhost:9200/" + str(indexName) + '/_search?' + "q=companyname:" + searchPayload
# 查询模式
payload={}
headers = {
'Content-Type': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=json.dumps(payload))
print("do search func: \n", json.loads(response.text))
searchPayload = '阿里' # ik + pinyin
tupleStartTime = datetime.datetime.now().microsecond # 时间元组
timestampStartTime = time.mktime(datetime.datetime.now().timetuple()) # 时间元组转时间戳
ElasticsearchController.search(indexName, searchPayload)
tupleEndTime = datetime.datetime.now().microsecond
timestampEndTime = time.mktime(datetime.datetime.now().timetuple())
strTime = 'do function time use:%dms' % ((timestampEndTime - timestampStartTime) * 1000 + (tupleEndTime - tupleStartTime) / 1000)
print(strTime)
通过普通方法查询“阿里”,无法匹配到结果,不符合预期
do search func:
{'took': 33, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}
do function time use:98ms
查询“albb”,无法匹配到结果,不符合预期
do search func:
{'took': 26, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}
do function time use:66ms
查询“阿里 巴巴”(中间有个空格):
do search func:
{'took': 44, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 8117, 'max_score': 8.404073, 'hits': [{'_index': 'postgis_test', '_type': 'doc', '_id': '1114', '_score': 8.404073, '_source': {'DZZYID': 1114, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴国际站公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '388946', '_score': 8.376436, '_source': {'DZZYID': 388946, 'type': 'postgis_test', 'DZZYMC_PINYIN': '成都阿里巴巴游戏公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1100', '_score': 8.339975, '_source': {'DZZYID': 1100, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1116', '_score': 8.338591, '_source': {'DZZYID': 1116, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴国际站商贸公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1126', '_score': 8.322807, '_source': {'DZZYID': 1126, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴合伙公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1225', '_score': 8.322807, '_source': {'DZZYID': 1225, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴装饰公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1211', '_score': 8.312388, '_source': {'DZZYID': 1211, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴运输公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1106', '_score': 8.312388, '_source': {'DZZYID': 1106, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴广州公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1110', '_score': 8.295631, '_source': {'DZZYID': 1110, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴国际店装修公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1102', '_score': 8.280003, '_source': {'DZZYID': 1102, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴管理公司'}}]}}
do function time use:72ms
查询“al bb”,虽然能匹配到,证明了ik pinyin分词器的成功作用,但是在匹配过程中发现准确率非常低,在这里需要进一步扩展:
do search func:
{'took': 39, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 46084, 'max_score': 4.3271093, 'hits': [{'_index': 'postgis_test', '_type': 'doc', '_id': '100907', '_score': 4.3271093, '_source': {'DZZYID': 100907, 'type': 'postgis_test', 'DZZYMC_PINYIN': '百宝公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '108108', '_score': 4.1653986, '_source': {'DZZYID': 108108, 'type': 'postgis_test', 'DZZYMC_PINYIN': '帮别公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '136254', '_score': 4.1653986, '_source': {'DZZYID': 136254, 'type': 'postgis_test', 'DZZYMC_PINYIN': '北碚公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '99371', '_score': 4.150325, '_source': {'DZZYID': 99371, 'type': 'postgis_test', 'DZZYMC_PINYIN': '白百北食品商贸公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '120584', '_score': 4.147024, '_source': {'DZZYID': 120584, 'type': 'postgis_test', 'DZZYMC_PINYIN': '保柏公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '100938', '_score': 4.1310782, '_source': {'DZZYID': 100938, 'type': 'postgis_test', 'DZZYMC_PINYIN': '百部湾航空公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '97385', '_score': 4.1031322, '_source': {'DZZYID': 97385, 'type': 'postgis_test', 'DZZYMC_PINYIN': '巴宝莉公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '136247', '_score': 4.098734, '_source': {'DZZYID': 136247, 'type': 'postgis_test', 'DZZYMC_PINYIN': '北碚保洁公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '136300', '_score': 4.098734, '_source': {'DZZYID': 136300, 'type': 'postgis_test', 'DZZYMC_PINYIN': '北碚装饰公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '308280', '_score': 4.098734, '_source': {'DZZYID': 308280, 'type': 'postgis_test', 'DZZYMC_PINYIN': '毕博咨询公司'}}]}}
do function time use:101ms
2020.12.16更新:注意在conf文件和template.json文件的配置过程中可能将原有的companyname字段已经替换为其它字段,需要使用替换后的字段进行查询
问题解决
加载个人设置的template不成功,报错:[logstash.outputs.elasticsearch] Failed to install template.
直接插入模板尝试复现该问题
curl -H "Content-Type: application/json" -XPUT localhost:9200/_template/template_1 -d '
{
"template": "postgis_test",
"mappings": {
"doc": {
"properties": {
"DZZYID": {
"type": "keyword"
},
"DZZYMC_PINYIN": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "pinyin"
}
}
}
}
}
'
再进一步解决该报错问题后,在curl方法下可以PUT成功,但是当执行文件时,仍然报错,暂未找到进一步的解决办法。
在各种尝试下,找到这篇博客https://blog.csdn.net/u014646662/article/details/94718834中的一些说法,主要是elasticsearch7.x版本一些配置文件写法已经进行了修改,而目前网上教程大部分是针对5.x版本和6.x版本的,所以在真正的使用场景上可能需要使用更加稳定的6.x版本。
在使用6.x版本后,解决了加载个人设置template不成功的问题。
elasticsearch6.3.0版本不受信任的开发者
terminal输入代码,其中xxx.app可以是不受信任文件的绝对路径(拖拽进来):
sudo xattr -r -d com.apple.quarantine /XXX/XXX.app/