2020.12-elasticsearch学习-使用logstash从本地SQL类数据库中向elasticsearch导入数据_logstash将mssql中的数据同步到elasticsearch-CSDN博客

本文链接：https://blog.csdn.net/CuriousLiu/article/details/111166766

2020.12-elasticsearch学习-使用logstash从本地SQL类数据库中向elasticsearch导入数据

Reference

如何安装elasticsearch栈中的logstash

mac&linux

运行最基本的 Logstash 管道来测试 Logstash 安装

创建一个自己的 .conf 文件，并存于文件系统的一个目录下

从mysql等服务中，利用logstash向elasticsearch导入数据

安装插件

下载安装包

将mysql中的数据导入到es中，并且要求在导入的过程中可以使用ik+pinyin自定义的analyzer

进行analyze，查询尝试，并对下一步内容进行规划

遗留问题1：查询过程中在template中配置的“search_analyzer”参数的作用是什么

遗留问题2：为什么查询结果非常差

问题解决

加载个人设置的template不成功，报错：[logstash.outputs.elasticsearch] Failed to install template.

elasticsearch6.3.0版本不受信任的开发者

Reference

如何安装elasticsearch栈中的logstash：https://blog.csdn.net/UbuntuTouch/article/details/99655350

从mariadb向elasticsearch中导入数据：https://www.cnblogs.com/mikeluwen/p/7686663.html

找到写的较好的可用的博客，目前解决办法大部分参照的是这篇博客：https://blog.csdn.net/andy511823558/article/details/106312759

template导入失败问题说明博客，在不同版本：https://blog.csdn.net/u014646662/article/details/94718834

解决elasticsearch6.x版本macos启动权限问题（不受信任的开发者）：https://blog.csdn.net/qq_42797418/article/details/107838127?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2.control

如何安装elasticsearch栈中的logstash

mac&linux

curl -L -O https://artifacts.elastic.co/downloads/logstash/logstash-7.3.0.tar.gz
tar -xzvf logstash-7.3.0.tar.gz

注意curl命令在哪个界面运行，就会把文件下载到哪个界面，在实际使用过程中发现该方法下载过慢

华为云镜像下载网站：

https://mirrors.huaweicloud.com/logstash/7.3.0/logstash-7.3.0.tar.gz

运行最基本的 Logstash 管道来测试 Logstash 安装

cd logstash-7.3.0
bin/logstash -e 'input { stdin { } } output { stdout {} }'

在尝试过程中报错

Logstash stopped processing because of an error: (ArgumentError) invalid byte sequence in US-ASCII

该错误是由于使用路径中包括中文字符导致的，修改中文路径名称问题后，成功启动elasticsearch

在stdin中输入一些字符串进行测试：

创建一个自己的 .conf 文件，并存于文件系统的一个目录下

heartbeat.conf（可以放置在其他目录下，例如/Desktop/heartbeat.conf）

input {
  heartbeat {
    interval => 10
    type => "heartbeat"
  }
}
 
output {
  stdout {
    codec => rubydebug
  }
}

通过执行如下命令，可以“运行”该conf文件（heartbeat心跳）

./bin/logstash -f /Desktop/heartbeat.conf

logstash-7.3.0 $ ./bin/logstash -f /Users/curious/Desktop/heartbeat.conf

从mysql等服务中，利用logstash向elasticsearch导入数据

安装插件

logstash-7.3.0 $ bin/logstash-plugin install logstash-input-jdbc
Validating logstash-input-jdbc
Installing logstash-input-jdbc
Installation successful

下载安装包

https://dev.mysql.com/downloads/connector/j/，选择不依赖于平台性的安装版本

解压缩后如下图所示：

拷贝mysql驱动jar包到如下目录：

/logstash-7.3.0/logstash-core/lib/jars/mysql-connector-java-8.0.21.jar

将mysql中的数据导入到es中，并且要求在导入的过程中可以使用ik+pinyin自定义的analyzer

该步骤主要是配置.conf文件，以及在导入过程中使用的mapping模板，核心 .conf文件如下所示：

在logstash目录下执行命令：

logstash-7.3.0 $ ./bin/logstash -f ./csdn.conf 其中.conf文件可以使用绝对路径

整体文件结构格式如下：

csdn.conf

input{  
    stdin{
       
    }
    jdbc{  
        jdbc_connection_string => "jdbc:mysql://localhost:3306/estest"  # jdbc数据库连接
        jdbc_user => "root"  
        jdbc_password => "数据库密码数据库密码数据库密码数据库密码"  
        jdbc_driver_library => ""  # postgresql jar包文件路径
        jdbc_driver_class => "com.mysql.jdbc.Driver"  # jdbc数据库驱动
        jdbc_paging_enabled => "true"  
        jdbc_page_size => "50000"  # 同步数据分页设置
        lowercase_column_names => "false"  #是否将 column 名称转小写，如果不写，默认将字段转成小写，false则不转小写
        #use_column_value => false  #使用其它字段追踪，而不是用时间
        #tracking_column => "UPDATETIME"
        #statement_filepath => "/root/opt/logstash/bin/logstash_jdbc_test/jdbc.sql"  # 需同步的数据执行的SQL文件路径
        statement => '  select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ' # companyName起别名name
        schedule => "* * * * *"   # 各字段含义（由左至右）分、时、天、月、年，全部为*默认含义为每分钟都更新
        type => "postgis_test"
    }
}

filter {
    mutate {
       # add_field => {"DATATYPE"=>"PT"}  
       # add_field => {"LAYER"=>"postgis_test"}  
       # add_field => {"LOCATION" => "%{Y},%{X}"}
       remove_field => "@version"
       remove_field => "@timestamp"   
    }    
}

output {
    #stdout{
        #codec => json_lines
    #}
    if [type] == "postgis_test"{
        elasticsearch{  
            hosts => ["http://localhost:9200"]  # ES连接
            index => "postgis_test"      # ES索引名称
            document_id => "%{DZZYID}"     #_id字段与 DZZYID 字段值一样，以便按照ID查询，_id是ES自带的字段，不能在input.statement中输入
            #document_type => "doc"
            manage_template => true
            template_overwrite => true
            template_name => "postgis_test"
            template => "./postgis_test.json"
            codec => json_lines
        } 
    }
}

postgis_test.json

{
    "template": "postgis_test",
    "mappings": {
        "doc": {
            "properties": {
                "DZZYID": {
                    "type": "keyword"
                },
                "DZZYMC_PINYIN": {
                    "type": "text",
                    "term_vector": "with_positions_offsets",
                    "analyzer":"ik_pinyin_analyzer",
                    "search_analyzer": "ik_pinyin_analyzer" 
                }
            }
        }
    },
    "settings" : {
        "analysis":{
            "analyzer":{
                "ik_pinyin_analyzer":{
                    "type":"custom",
                    "tokenizer":"ik_max_word",
                    "filter": "pinyin_filter"
                }
            },
            "filter":{
                "pinyin_filter":{
                    "type" : "pinyin",
                    "keep_separate_first_letter" : "false",
                    "keep_full_pinyin" : "true",
                    "keep_original" : "true",
                    "limit_first_letter_length" : "16",
                    "lowercase" : "true",
                    "remove_duplicated_term" : "true"
                }
            }
        }
    }
}

这种方法开始导入后，可通过pretty查看当前使用的mapping策略，可以看到ik_pinyin_analyzer被正确使用：

    @staticmethod
    def pretty(indexName):
        url = "http://localhost:9200/" + str(indexName)  + '?pretty'
        payload={}
        headers = {
            'Content-Type': 'application/json'
        }
        response = requests.request("GET", url, headers=headers, data=json.dumps(payload))        

        print("do pretty func: \n", json.loads(response.text))

ElasticsearchController.pretty(indexName)
do pretty func: 
 {'postgis_test': {'aliases': {}, 'mappings': {'doc': {'properties': {'DZZYID': {'type': 'keyword'}, 'DZZYMC_PINYIN': {'type': 'text', 'term_vector': 'with_positions_offsets', 'analyzer': 'ik_pinyin_analyzer'}, 'type': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}, 'settings': {'index': {'number_of_shards': '5', 'provided_name': 'postgis_test', 'creation_date': '1608016087623', 'analysis': {'filter': {'pinyin_filter': {'lowercase': 'true', 'keep_original': 'true', 'remove_duplicated_term': 'true', 'keep_separate_first_letter': 'false', 'type': 'pinyin', 'limit_first_letter_length': '16', 'keep_full_pinyin': 'true'}}, 'analyzer': {'ik_pinyin_analyzer': {'filter': 'pinyin_filter', 'type': 'custom', 'tokenizer': 'ik_max_word'}}}, 'number_of_replicas': '1', 'uuid': 'y1CqPflbTYWl45BDSXQ8Mw', 'version': {'created': '6030099'}}}}}

使用该种方法把sql数据导入到elasticsearch中，导入速度可超过1000条/s（短文本），数据库示例如下：

导入过程中，可以通过_count函数查看导入的数据计数count

    def countIndex(indexName):
        url = "http://localhost:9200/" + str(indexName) + '/_count'

        payload={}
        headers = {
            'Content-Type': 'application/json'
        }
        response = requests.request("GET", url, headers=headers, data=json.dumps(payload))

        print("do countIndex func: \n", json.loads(response.text))

ElasticsearchController.countIndex(indexName='postgis_test')

>>>do countIndex func: 
 {'count': 423250, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}}

导入速度log：

[2020-12-15T16:22:03,627][INFO ][logstash.inputs.jdbc     ] (0.124891s) SELECT * FROM (  select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 0
[2020-12-15T16:22:34,442][INFO ][logstash.inputs.jdbc     ] (0.125908s) SELECT * FROM (  select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 50000
[2020-12-15T16:23:04,448][INFO ][logstash.inputs.jdbc     ] (0.106033s) SELECT * FROM (  select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 100000
[2020-12-15T16:23:32,037][INFO ][logstash.inputs.jdbc     ] (0.140638s) SELECT * FROM (  select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 150000
[2020-12-15T16:24:01,441][INFO ][logstash.inputs.jdbc     ] (0.110575s) SELECT * FROM (  select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 200000
[2020-12-15T16:24:34,641][INFO ][logstash.inputs.jdbc     ] (0.345005s) SELECT * FROM (  select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 250000
[2020-12-15T16:25:04,252][INFO ][logstash.inputs.jdbc     ] (0.122336s) SELECT * FROM (  select gid as DZZYID, companyName as DZZYMC_PINYIN from CompanyName ) AS `t1` LIMIT 50000 OFFSET 300000

进行analyze，查询尝试，并对下一步内容进行规划

遗留问题1：查询过程中在template中配置的“search_analyzer”参数的作用是什么

在一些方法中看到在查询过程中设置了search_analyzer，该参数是否有作用，作用是什么？

遗留问题2：为什么查询结果非常差

在目前的pinyin+ik配置下，需要进一步了解elasticsearch的查询过程，通过学习不同查询模式，不同配置方式，进行查询过程的了解，例如search和searchByPhrase的区别

普通方法：

    @staticmethod
    def search(indexName, searchPayload):
        url = "http://localhost:9200/" + str(indexName) + '/_search?' + "q=companyname:" + searchPayload
        # 查询模式
        payload={}
        headers = {
            'Content-Type': 'application/json'
        }
        response = requests.request("GET", url, headers=headers, data=json.dumps(payload))

        print("do search func: \n", json.loads(response.text))


    searchPayload = '阿里' # ik + pinyin
    tupleStartTime = datetime.datetime.now().microsecond # 时间元组
    timestampStartTime = time.mktime(datetime.datetime.now().timetuple()) # 时间元组转时间戳
    
    ElasticsearchController.search(indexName, searchPayload)

    tupleEndTime = datetime.datetime.now().microsecond
    timestampEndTime = time.mktime(datetime.datetime.now().timetuple())
    strTime = 'do function time use:%dms' % ((timestampEndTime - timestampStartTime) * 1000 + (tupleEndTime - tupleStartTime) / 1000)
    print(strTime)

通过普通方法查询“阿里”，无法匹配到结果，不符合预期

do search func:
{'took': 33, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}
do function time use:98ms

查询“albb”，无法匹配到结果，不符合预期

do search func:
{'took': 26, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}
do function time use:66ms

查询“阿里巴巴”（中间有个空格）：

do search func:
{'took': 44, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 8117, 'max_score': 8.404073, 'hits': [{'_index': 'postgis_test', '_type': 'doc', '_id': '1114', '_score': 8.404073, '_source': {'DZZYID': 1114, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴国际站公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '388946', '_score': 8.376436, '_source': {'DZZYID': 388946, 'type': 'postgis_test', 'DZZYMC_PINYIN': '成都阿里巴巴游戏公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1100', '_score': 8.339975, '_source': {'DZZYID': 1100, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1116', '_score': 8.338591, '_source': {'DZZYID': 1116, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴国际站商贸公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1126', '_score': 8.322807, '_source': {'DZZYID': 1126, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴合伙公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1225', '_score': 8.322807, '_source': {'DZZYID': 1225, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴装饰公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1211', '_score': 8.312388, '_source': {'DZZYID': 1211, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴运输公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1106', '_score': 8.312388, '_source': {'DZZYID': 1106, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴广州公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1110', '_score': 8.295631, '_source': {'DZZYID': 1110, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴国际店装修公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '1102', '_score': 8.280003, '_source': {'DZZYID': 1102, 'type': 'postgis_test', 'DZZYMC_PINYIN': '阿里巴巴管理公司'}}]}}
do function time use:72ms

查询“al bb”，虽然能匹配到，证明了ik pinyin分词器的成功作用，但是在匹配过程中发现准确率非常低，在这里需要进一步扩展：

do search func:
{'took': 39, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 46084, 'max_score': 4.3271093, 'hits': [{'_index': 'postgis_test', '_type': 'doc', '_id': '100907', '_score': 4.3271093, '_source': {'DZZYID': 100907, 'type': 'postgis_test', 'DZZYMC_PINYIN': '百宝公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '108108', '_score': 4.1653986, '_source': {'DZZYID': 108108, 'type': 'postgis_test', 'DZZYMC_PINYIN': '帮别公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '136254', '_score': 4.1653986, '_source': {'DZZYID': 136254, 'type': 'postgis_test', 'DZZYMC_PINYIN': '北碚公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '99371', '_score': 4.150325, '_source': {'DZZYID': 99371, 'type': 'postgis_test', 'DZZYMC_PINYIN': '白百北食品商贸公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '120584', '_score': 4.147024, '_source': {'DZZYID': 120584, 'type': 'postgis_test', 'DZZYMC_PINYIN': '保柏公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '100938', '_score': 4.1310782, '_source': {'DZZYID': 100938, 'type': 'postgis_test', 'DZZYMC_PINYIN': '百部湾航空公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '97385', '_score': 4.1031322, '_source': {'DZZYID': 97385, 'type': 'postgis_test', 'DZZYMC_PINYIN': '巴宝莉公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '136247', '_score': 4.098734, '_source': {'DZZYID': 136247, 'type': 'postgis_test', 'DZZYMC_PINYIN': '北碚保洁公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '136300', '_score': 4.098734, '_source': {'DZZYID': 136300, 'type': 'postgis_test', 'DZZYMC_PINYIN': '北碚装饰公司'}}, {'_index': 'postgis_test', '_type': 'doc', '_id': '308280', '_score': 4.098734, '_source': {'DZZYID': 308280, 'type': 'postgis_test', 'DZZYMC_PINYIN': '毕博咨询公司'}}]}}
do function time use:101ms

2020.12.16更新：注意在conf文件和template.json文件的配置过程中可能将原有的companyname字段已经替换为其它字段，需要使用替换后的字段进行查询

问题解决

加载个人设置的template不成功，报错：[logstash.outputs.elasticsearch] Failed to install template.

直接插入模板尝试复现该问题

curl -H "Content-Type: application/json" -XPUT localhost:9200/_template/template_1 -d '
{
    "template": "postgis_test",
    "mappings": {
        "doc": {
            "properties": {
                "DZZYID": {
                    "type": "keyword"
                },
                "DZZYMC_PINYIN": {
                    "type": "text",
                    "term_vector": "with_positions_offsets",
                    "analyzer": "pinyin"
                }
            }
        }
    }
}
'

再进一步解决该报错问题后，在curl方法下可以PUT成功，但是当执行文件时，仍然报错，暂未找到进一步的解决办法。

在各种尝试下，找到这篇博客https://blog.csdn.net/u014646662/article/details/94718834中的一些说法，主要是elasticsearch7.x版本一些配置文件写法已经进行了修改，而目前网上教程大部分是针对5.x版本和6.x版本的，所以在真正的使用场景上可能需要使用更加稳定的6.x版本。

在使用6.x版本后，解决了加载个人设置template不成功的问题。

elasticsearch6.3.0版本不受信任的开发者

使用该篇博客的方法解决了这个问题：https://blog.csdn.net/qq_42797418/article/details/107838127?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2.control

terminal输入代码，其中xxx.app可以是不受信任文件的绝对路径（拖拽进来）：

sudo xattr -r -d com.apple.quarantine /XXX/XXX.app/

在这里插入图片描述