Elasticsearch 2.3.4 + Elasticsearch-jdbc 2.3.4.1 + analysis-ik 1.9.4搭建后端中文分词搜索引擎

本文链接：https://blog.csdn.net/stonesola/article/details/68489172

Elasticsearch 2.3.4 + Elasticsearch-jdbc 2.3.4.1 + analysis-ik 1.9.4搭建后端中文分词搜索引擎

Elasticsearch 234 Elasticsearch-jdbc 2341 analysis-ik 194搭建后端中文分词搜索引擎

前言

Elasticsearch（ES）是基于Lucene 的搜索引擎，在全文搜索方面有巨大优势，以文档形式存储，适用于分布式。对于以MySQL为数据库的搜索就没有了天然的支持，因此由工具弥补Elasticsearch和MySQL的关联。

还需要解决的问题是使数据库的数据增量能及时在ES中更新，现在主要有几种：

elasticsearch-jdbc：持续更新，最终选择
logstash-input-jdbc：基于JRuby，官方推荐使用
elasticsearch-river-jdbc：断更
go-mysql-elastic：比较新，可以尝试

这几种都试过，elasticsearch-jdbc和logstash-input-jdbc推荐使用，但是由于某些原因logstash-input-jdbc没有成功，很遗憾。踩坑较多，选择了相对成熟的elasticsearch-jdbc。

中文分词使用ik，配合elasticsearch-jdbc支持的最高版本Elasticsearch 2.3.4，ik选择使用1.9.4。

版本问题很重要！

以ubuntu14.04服务器为例。安装目录为$HOME

ES搭建

官网下载2.3.4版本zip文件：Elasticsearch-2.3.4.zip。

解压缩： $ unzip elasticsearch-2.3.4.zip

cd进入解压后的ES目录，修改config/elasticsearch.yml,修改cluster名和host

$ cd elasticsearch-2.3.4

$ nano ./config/elasticsearch.yml

cluster.name: your-cluster-name
network.host: 0.0.0.0

host改为0.0.0.0为了在外网访问.

启动elasticsearch，验证安装成功。

./bin/elasticsearch

关闭服务和后台启动（重启）：

$ ps -ef | grep elasticsearch
master_+  1547     1  0 01:18 ?        00:01:23 /usr/lib/jvm/java-8-oracle/bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/home/master_user/elasticsearch-2.3.4 -cp /home/master_user/elasticsearch-2.3.4/lib/elasticsearch-2.3.4.jar:/home/master_user/elasticsearch-2.3.4/lib/* org.elasticsearch.bootstrap.Elasticsearch start 
$ sudo kill -9 1547
$ elasticsearch-2.3.4/bin/elasticsearch -d

关闭ES或可使用：

$ curl -XPOST http://主机IP：9200/_cluster/nodes/_shutdown

安装ik

直接去下载lib文件:

$ wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.4/elasticsearch-analysis-ik-1.9.4.zip

将下载的zip解压缩到ES的plugins目录下的ik文件夹中

$ mkdir elasticsearch-2.3.4/plugins
$ mkdir ik && cd ik
$ unzip elasticsearch-analysis-ik-1.9.4.zip
$ cd ../
$ cp -R ik elasticsearch-2.3.4/plugins/

配置elasticsearch.yml,添加ik配置

$ nano elasticsearch-2.3.4/config/elasticsearch.yml
index.analysis.analyzer.ik.type: "ik"

保存后重启elasticsearch

验证：

打开http://主机IP:9200/index/_analyze?analyzer=ik&pretty=true&text=我的魔法会把你撕成碎片
{
  "tokens" : [ {
    "token" : "我",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_CHAR",
    "position" : 0
  }, {
    "token" : "魔法",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "法会",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "会把",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "你",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 4
  }, {
    "token" : "撕成",
    "start_offset" : 7,
    "end_offset" : 9,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "碎片",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "token" : "碎",
    "start_offset" : 9,
    "end_offset" : 10,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "token" : "片",
    "start_offset" : 10,
    "end_offset" : 11,
    "type" : "CN_CHAR",
    "position" : 8
  } ]
}

使用ik_smart结果会是这样：

浏览器打开：http://主机名:9200/index/_analyze?analyzer=ik_smart&text=我的魔法会把你撕成碎片
{
  "tokens": [
    {
      "token": "我",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "魔法",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "会把",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "你",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "撕成",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "碎片",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 5
    }
  ]
}

安装jdbc插件，返回搜索数据

下载插件v2.3.4.1，解压缩

$ wget http://xbib.org/repository/org/xbib/elasticsearch/importer/elasticsearch-jdbc/2.3.4.1/elasticsearch-jdbc-2.3.4.1-dist.zip
$ unzip elasticsearch-jdbc-2.3.4.1-dist.zip
$ mv elasticsearch-jdbc-2.3.4.1-dist jdbc

为了方便，将ES和jdbc的路径设置为变量

$ sudo nano /etc/profile
export JDBC_HOME=/home/master_user/jdbc
export ES_HOME=/home/master_user/elasticsearch-2.3.4

$ source /etc/profile

jdbc安装还算简单，把package下载下来就可以用了。创建一个shell文件夹，来编写jdbc连接ES和mysql的脚本，和生成的一些必要的文件。

shell在你的home地址~下，当jdbc连接成功后shell的目录是这样的：

$ tree shell/
shell/
├── logs
│   └── jdbc.log
├── statefile-tag.json
└── up-tag.sh

接下来创建目录和文件:

$ mkdir shell shell/logs
$ touch up-tag.sh
$ nano up-tag.sh

编写连接脚本up-tag.sh(参考官方文档 ps：不可全新)：

#!/bin/sh
bin=$JDBC_HOME/bin
lib=$JDBC_HOME/lib
echo '{
    "type":"jdbc",
    "jdbc":{
        "url":"jdbc:mysql://localhost:3306/your_database",
        "user":"user",
        "password":"password",
        "cluster": "your-cluster-name",
        "statefile" : "statefile-tag.json",
        "schedule" : "0/10 0-59 0-23 ? * *",
        "sql" : [
            {
                "statement" : "select *,id as _id from club_tag where update_time > ?",
                "parameter" : [ "$metrics.lastexecutionstart" ]
            }
        ],
        "elasticsearch" : {
            "cluster" : "your-cluster-name",
            "host" : "0.0.0.0",
            "port" : 9300
        },
        "index":"club",
        "type":"tag",
        "index_settings" : {
            "analysis" : {
                "analyzer" : {
                    "ik" : {
                        "tokenizer" : "ik"
                    }
                }
            }
        },
        "type_mapping" :{
            "tag": {
                "properties": {
                    "id":{
                        "type":"integer",
                        "index":"not_analyzed"
                    },
                    "tag_name":{
                        "type":"string",
                        "analyzer" : "ik"
                    },
                    "update_time" : {
                        "type" : "date"
                    }
                }
            }
        }
    }
}' | java \
    -cp "${lib}/*" \
    -Dlog4j.configurationFile=${bin}/log4j2.xml \
    org.xbib.tools.Runner \
    org.xbib.tools.JDBCImporter

当成功执行up-tag.sh时，自动生成staticfile-tag.json和jdbc.log。”statefile” : “statefile-tag.json”这一条配置staticfile即staticfile-tag.json的生成。

连接的过程是这样的：启动up-tag.sh，产生的staticfile中说明了数据库连接方式和字段匹配(mapping)，ES 执行过程中通过staticfile的配置生成index。

为了保持数据库增量的自动更新，所查询的表中需要有时间戳字段，如例子中的”update_time”。

“parameter” : [ “$metrics.lastexecutionstart” ]是指SQL语句最后一次执行的开始时间的时间戳。

schedule ：计划任务时间表，更新的时间差。

0 0-59 0-23 ? * *：每分钟执行一次

0/10 0-59 0-23 ? * * : 10秒执行一次

Field Name	Allowed Values	Allowed Special Characters
Seconds	0-59	, - * /
Minutes	0-59	, - * /
Hours	0-23	, - * /
Day-of-month	1-31	, - * ? / L W
Month	1-12 or JAN-DEC	, - * /
Day-of-Week	1-7 or SUN-SAT	, - * ? / L #
Year (Optional)	empty, 1970-2199	, - * /

这是官方给的标准，具体请见官方文档.

编写好脚本后，执行shell/up-tag.sh.后台执行：

$ cd shell
$ nohup up-tag.sh

最后验证：

$  curl -XGET 'http://localhost:9200/club/tag/_search?pretty' -d '
> {
>     "query" : { "match" :  { "tag_name": "我的魔法" } },
>     "highlight" : {
>         "pre_tags" : ["<tag1>", "<tag2>"],
>         "post_tags" : ["</tag1>", "</tag2>"],
>         "fields" : {
>             "tag_name" : {}
>         }
>     }
> }
> '
{
  "took" : 714,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.44194174,
    "hits" : [ {
      "_index" : "club",
      "_type" : "tag",
      "_id" : "1",
      "_score" : 0.44194174,
      "_source" : {
        "id" : 1,
        "tag_name" : "我的魔法会把你撕成碎片",
        "update_time" : "2017-03-30T02:40:49.000+08:00"
      },
      "highlight" : {
        "tag_name" : [ "<tag1>我</tag1>的<tag1>魔法</tag1>会把你撕成碎片" ]
      }
    }, {
      "_index" : "club",
      "_type" : "tag",
      "_id" : "4",
      "_score" : 0.16692422,
      "_source" : {
        "id" : 4,
        "tag_name" : "嘿嘿嘿，我马上照办",
        "update_time" : "2017-03-30T02:40:38.000+08:00"
      },
      "highlight" : {
        "tag_name" : [ "嘿嘿嘿，<tag1>我</tag1>马上照办" ]
      }
    }, {
      "_index" : "club",
      "_type" : "tag",
      "_id" : "2",
      "_score" : 0.13353938,
      "_source" : {
        "id" : 2,
        "tag_name" : "让我们来一场魔法盛宴吧",
        "update_time" : "2017-03-30T02:40:45.000+08:00"
      },
      "highlight" : {
        "tag_name" : [ "让我们来一场<tag1>魔法</tag1>盛宴吧" ]
      }
    }, {
      "_index" : "club",
      "_type" : "tag",
      "_id" : "3",
      "_score" : 0.016878016,
      "_source" : {
        "id" : 3,
        "tag_name" : "你的魔法也救不了你",
        "update_time" : "2017-03-30T02:40:41.000+08:00"
      },
      "highlight" : {
        "tag_name" : [ "你的<tag1>魔法</tag1>也救不了你" ]
      }
    } ]
  }
}