分布式搜索引擎Elasticsearch—kafka数据同步插件

最新推荐文章于 2024-06-15 20:12:08 发布

jackchen10

最新推荐文章于 2024-06-15 20:12:08 发布

阅读量5.7k

点赞数

分类专栏：分布式文章标签： Kafka插件分布式搜索引擎 Elasticsearch

分布式专栏收录该内容

5 篇文章 0 订阅

订阅专栏

river代表es的一个数据源，也是其它存储方式（如：数据库）同步数据到es的一个方法。它是以插件方式存在的一个es服务，通过读取river中的数据并把它索引到es中，官方的river有couchDB的，RabbitMQ的，Twitter的，Wikipedia的。关于kafka的介绍请参见之前的文章。

1. 开源插件：elasticsearch-river-kafka

插件的安装和使用在github（https://github.com/endgameinc/elasticsearch-river-kafka）介绍的很详细。这里需要提到的是，插件对kafka中数据的格式有严格的定义：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
        { 
       
        "index" 
          
        : 
          
        "example_index" 
        , 
       
        "type" 
          
        : 
          
        "example_type" 
        , 
       
        "id" 
          
        : 
          
        "asdkljflkasjdfasdfasdf" 
        , 
       
        "source" 
          
        : 
          
        { 
          
        . 
        . 
        . 
        . 
        . 
          
        } 
       
        }

其中，index是索引名，type是索引类型，id是这条数据的id，source就是数据内容。而我们的新闻数据在kafka中的格式如下：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
        { 
       
        "id" 
          
        : 
          
        "asdkljflkasjdfasdfasdf" 
        , 
       
        "site_id" 
          
        : 
          
        100 
        , 
       
        "title" 
          
        : 
          
        "hello word！" 
        , 
       
        "media_type" 
          
        : 
          
        1 
       
        }

这样，就需要修改插件源码来满足需求。

2. 工程搭建

git clone https://github.com/endgameinc/elasticsearch-river-kafka.git

使用eclipse打开工程。

3. 自定义MessageHandler

实现getSource方法：

 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       

         7 
       

         8 
       

         9 
       

         10 
       

         11 
       
 
        protected 
          
        Map 
        \ 
        < 
        String 
        , 
          
        Object 
        > 
          
        getSource 
        ( 
        ) 
          
        { 
       
 
        Map 
        < 
        String 
        , 
          
        Object 
        > 
          
        src 
          
        = 
          
        new 
          
        HashMap 
        < 
        String 
        , 
          
        Object 
        > 
        ( 
        ) 
        ; 
       
 
        try 
          
        { 
       
 
        src 
        . 
        put 
        ( 
        "site_id" 
        , 
          
        messageMap 
        . 
        get 
        ( 
        "site_id" 
        ) 
        ) 
        ; 
       
 
        src 
        . 
        put 
        ( 
        "title" 
        , 
          
        messageMap 
        . 
        get 
        ( 
        "title" 
        ) 
        ) 
        ; 
       
 
        src 
        . 
        put 
        ( 
        "media_type" 
        , 
          
        messageMap 
        . 
        get 
        ( 
        "media_type" 
        ) 
        ) 
        ; 
       
 
        } 
          
        catch 
          
        ( 
        Exception 
          
        e 
        ) 
          
        { 
       
 
        logger 
        . 
        warn 
        ( 
        "解析source错误，msg=" 
          
        + 
          
        messageMap 
        . 
        toString 
        ( 
        ) 
        , 
          
        e 
        ) 
        ; 
       
 
        } 
       
 
        return 
          
        src 
        ; 
       
 
        } 
       
 
 

实现getIndex和getType方法：

index和type在我们的数据里面是没有没有的，那么就需要自己通过配置载入。在配置文件中添加模块：

现在，MessageHandlerFactory内部需要得到配置文件，修改MessageHandler的构造函数和MessageHandlerFactory的接口，添加settings参数。例如MessageHandlerFactory：

public MessageHandler createMessageHandler(Client client, RiverSettings settings) throws Exception; 这样，在NewsJsonMessageHandler中可以得到配置参数信息：

 
   
 
 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       
 
        private 
          
        Map 
        < 
        String 
        , 
          
        Object 
        > 
          
        newsSettings 
        ; 
       
 
        public 
          
        NewsJsonMessageHandler 
        ( 
        Client  
        client 
        , 
          
        RiverSettings  
        settings 
        ) 
          
        { 
       
 
        this 
        . 
        client 
          
        = 
          
        client 
        ; 
       
 
        newsSettings 
          
        = 
          
        ( 
        Map 
        < 
        String 
        , 
          
        Object 
        > 
        ) 
          
        settings 
        . 
        settings 
        ( 
        ) 
        . 
        get 
        ( 
        "news" 
        ) 
        ; 
       
 
        logger 
        . 
        info 
        ( 
        "news settings: " 
          
        + 
          
        newsSettings 
        . 
        toString 
        ( 
        ) 
        ) 
        ; 
       
 
        } 
       
 
 

getIndex和getType方法分别为：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
        protected 
          
        String 
          
        getIndex 
        ( 
        ) 
          
        { 
       
        return 
          
        ( 
        String 
        ) 
          
        newsSettings 
        . 
        get 
        ( 
        "index" 
        ) 
        ; 
       
        } 
       
        protected 
          
        String 
          
        getType 
        ( 
        ) 
          
        { 
       
        return 
          
        ( 
        String 
        ) 
          
        newsSettings 
        . 
        get 
        ( 
        "type" 
        ) 
        ; 
       
        }

4. kafka编译版本问题

elasticsearch-river-kafka使用java 1.7编译的，需要改为1.6。另外mvn默认引入的kafka-0.7.2.jar也是java 1.7编译的。需要使用我们自己使用java1.6编译的版本。

5. 最终的添加同步任务命令

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
        curl 
          
        - 
        XPUT 
          
        'local:9200/_river/news_kafka_river_0/_meta' 
          
        - 
        d 
          
        '{ 
       
        "type" : "kafka", 
       
        "kafka" : { 
       
        "broker_host" : "mota32", 
       
        "message_handler_factory_class" : "com.weidou.elasticsearch.river.NewsJsonMessageHandlerFactory", 
       
        "zookeeper" : "mota32", 
       
        "topic" : "es-test1", 
       
        "partition" : "0", 
       
        "broker_port" : 9092 
       
        }, 
       
        "index" : { 
       
        "bulk_size_bytes" : 10000000, 
       
        "bulk_timeout" : "1000ms" 
       
        }, 
       
        "statsd":{ 
       
        "prefix": "es-kafka-river", 
       
        "host": "mota33", 
       
        "port": "8125" 
       
        }, 
       
        "news":{ 
       
        "index": "test", 
       
        "type": "news" 
       
        } 
       
        }'

6. 删除和安装elasticsearch-river-kafka的方法

 
         1 
       
         2 
       
        . 
        / 
        plugin 
          
        - 
        remove  
        elasticsearch 
        - 
        river 
        - 
        kafka 
       
        . 
        / 
        plugin 
          
        - 
        url  
        http 
        : 
        //www.xxoo.com/static/elasticsearch-river-kafka-1.0.2-SNAPSHOT.zip -install elasticsearch-river-kafka

将 elasticsearch-river-kafka-1.0.2-SNAPSHOT.zip放在nginx服务器的静态文件中，便于团队内部分享插件。当然，你也可以使用插件github主页介绍的本地文件安装方法。

参考：

elasticsearch-river-kafka https://github.com/endgameinc/elasticsearch-river-kafka

jackchen10

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
分布式搜索引擎Elasticsearch—kafka数据同步插件

river代表es的一个数据源，也是其它存储方式（如：数据库）同步数据到es的一个方法。它是以插件方式存在的一个es服务，通过读取river中的数据并把它索引到es中，官方的river有couchDB的，RabbitMQ的，Twitter的，Wikipedia的。关于kafka的介绍请参见之前的文章。1. 开源插件：elasticsearch-river-kafka插件的安装和使
复制链接

扫一扫

专栏目录