搜索引擎选型调研之Flume1.6+Elasticsearch2.3.1

最新推荐文章于 2024-08-20 18:45:19 发布

杰仕人生1

最新推荐文章于 2024-08-20 18:45:19 发布

阅读量2k

点赞数

本文链接：https://blog.csdn.net/nuoyahadili8/article/details/51405424

版权

搜索引擎选型调研之Elasticsearch

最近的一个项目中，由于为了满足实时搜索的功能，一直在致力于选择合适的搜索引擎。起初的设计选型是Hbase+solr，用solr做hbase的二级索引，用coporcessor做索引同步。当单纯的对已有数据进行搜索时，solr表现还不错。但由于场景是实时建立索引时，solr会产生io阻塞，查询性能较差。再随着数据量的增加，solr的搜索效率会变得更低。为了满足实时搜索功能，随后发现了elasticsearch分布式搜索框架，期间也通过了多次反复验证讨论后，决定用elasticsearch做搜索引擎。它基本上所有我想要的特性都包含了：分布式搜索，分布式索引，自动分片，索引自动负载，自动发现，restful风格接口。于是开始使用部署了2台机子，即把索引分成5片，5个复本。

通过elasticsearch的一些管理工具插件可以很清晰的看到索引分片及分布情况：哪块分布在哪里，占用多少空间等都可以看到，并且可以管理索引。还发现当一台机挂了时，整个系统会对挂机里的内容重新分配到其它机器上，当挂掉的机重新加入集群时，又会重新把索引分配给它。当然，这些规则都是可以根据参数进行设置的，非常灵活。对它的搜索效率进行测试，查询时间基本上维持在200毫秒左右，第二次搜索的话因为有缓存，所以和solr差不多。但经过详细对比测试后发现，solr在建索引时的查询性能非常之差，因为solr在建索引时会产生io的阻塞，造成搜索性能的下降，但elasticsearch不会，因为它是先把索引的内容保存到内存之中，当内存不够时再把索引持久化到硬盘中，同时它还有一个队列，是在系统空闲时自动把索引写到硬盘中。

在浏览器输入：http://ip:9200/_plugin/bigdesk

Elasticsearch的存储方式有四种，1.像普通的lucene索引，存储在本地文件系统中。2.存储在分布式文件系统中，如freeds。3.存储在Hadoop的hdfs中。4.存储在亚马逊的s3云平台中。它支持插件机制，有丰富的插件。

Elasticsearch与Solr

性能对比：http://i.zhcy.tk/blog/elasticsearchyu-solr/

功能对比：http://solr-vs-elasticsearch.com/

elasticsearch官网：http://www.elasticsearch.org/

下面进行实战操作，使用Flume1.6+Elasticsearch2.3从数据采集到索引查询的操作。虽然官网推荐使用Logstash做数据采集，为了降低学习成本，另一方面对Flume比较熟悉。

一、Flume数据采集

(1) 编写自定义elasticsearch-sink的flume组件（默认flume与Elasticsearch版本冲突无法使用）

ElasticSearchCoreSink核心脚本：

public class ElasticSearchCoreSink extends AbstractSink implements Configurable{

    private static Logger log = Logger.getLogger(ElasticSearchCoreSink.class);
    private String clusterName;    //集群名称
    private String indexName;  //索引名称
    private String indexType;  //文档名称
    private String hostName;   //主机IP
    private long ttlMs=-1;    //Elasticsearch的字段TTL失效时间
    private String[] fields;   //字段
    private String splitStr;   //扫描文本文件字段分隔符
    private int batchSize;    //缓冲提交数

    private final Pattern pattern = Pattern.compile(TTL_REGEX,Pattern.CASE_INSENSITIVE);
    private Matcher matcher = pattern.matcher("");
    private Client client=null;    //Elasticsearch客户端

    @Override
    public Status process() throws EventDeliveryException {
        Status status = null;

        Channel ch = getChannel();
        Transaction txn = ch.getTransaction();
        txn.begin();
        try {
            Event event = ch.take();
            String out = new String(event.getBody(),"UTF-8");
            // Send the Event to the external repository.
            log.info("文本行信息>>>>"+out);
            int sz=out.split(splitStr).length;
            if(sz==fields.length){
                String json=ElasticSearchClientFactory.generateJson(fields,out,splitStr);
                IndexRequestBuilder irb=client.prepareIndex(indexName, indexType).setSource(json);
                irb.setTTL(ttlMs);
                //存入缓冲
                ElasticsearchWriter.addDocToCache(irb,batchSize);
            }

            txn.commit();
            status = Status.READY;
        } catch (Throwable t) {
            txn.rollback();
            status = Status.BACKOFF;
            if (t instanceof Error) {
                throw (Error)t;
            }
        } finally {
            txn.close();
        }
        return status;
    }

    @Override
    public void configure(Context context) {
        String cn=context.getString("clusterName", "ClusterName");
        String in=context.getString("indexName", "IndexName");
        String it=context.getString("indexType", "IndexType");
        String hn=context.getString("hostName", "127.0.0.1");
        String tt=context.getString("ttl", "7*24*60*60*1000");
        String fs=context.getString("fields", "content");
        String ss=context.getString("splitStr", "\\|");
        String bs=context.getString("batchSize","10");
        this.clusterName=cn;
        this.indexName=in;
        this.indexType=it;
        this.hostName=hn;
        this.ttlMs=parseTTL(tt);
        this.splitStr=ss;
        this.fields=fs.trim().split(",");
        this.batchSize=Integer.parseInt(bs);
        for(String f:fields){
            System.out.println("field@"+f);
        }
        if(client==null){
            try {
                this.client=ElasticSearchClientFactory.getClient(hostName,clusterName);
            } catch (NoSuchClientTypeException e) {
                log.info("配置文件中：集群名字与主机有误，请检查！");
            }
        }
    }

    private long parseTTL(String ttl) {
        matcher = matcher.reset(ttl);
        while (matcher.find()) {
            if (matcher.group(2).equals("ms")) {
                return Long.parseLong(matcher.group(1));
            } else if (matcher.group(2).equals("s")) {
                return TimeUnit.SECONDS.toMillis(Integer.parseInt(matcher.group(1)));
            } else if (matcher.group(2).equals("m")) {
                return TimeUnit.MINUTES.toMillis(Integer.parseInt(matcher.group(1)));
            } else if (matcher.group(2).equals("h")) {
                return TimeUnit.HOURS.toMillis(Integer.parseInt(matcher.group(1)));
            } else if (matcher.group(2).equals("d")) {
                return TimeUnit.DAYS.toMillis(Integer.parseInt(matcher.group(1)));
            } else if (matcher.group(2).equals("w")) {
                return TimeUnit.DAYS.toMillis(7 * Integer.parseInt(matcher.group(1)));
            } else if (matcher.group(2).equals("")) {
                log.info("TTL qualifier is empty. Defaulting to day qualifier.");
                return TimeUnit.DAYS.toMillis(Integer.parseInt(matcher.group(1)));
            } else {
                log.debug("Unknown TTL qualifier provided. Setting TTL to 0.");
                return 0;
            }
        }
        log.info("TTL not provided. Skipping the TTL config by returning 0.");
        return 0;
    }
}

ElasticsearchWriter缓冲提交组

public class ElasticsearchWriter {
    private static Logger log = Logger.getLogger(ElasticsearchWriter.class);
    private static int maxCacheCount = 10; // 缓存大小，当达到该上限时提交
    private static Vector<IndexRequestBuilder> cache = null; // 缓存
    public static Lock commitLock = new ReentrantLock(); // 在添加缓存或进行提交时加锁
    private static int maxCommitTime = 60; // 最大提交时间，s
    public static Client client;

    static {
        log.info("elasticsearch init param");
        try {
            client=ElasticSearchClientFactory.getClient();
            cache = new Vector<IndexRequestBuilder>(maxCacheCount);
            // 启动定时任务，第一次延迟10执行,之后每隔指定时间执行一次
            Timer timer = new Timer();
            timer.schedule(new CommitTimer(), 10 * 1000, maxCommitTime * 1000);
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
    // 批量索引数据
    public static boolean blukIndex(List<IndexRequestBuilder> indexRequestBuilders) {
        Boolean isSucceed = true;
        BulkRequestBuilder bulkBuilder = client.prepareBulk();
        for (IndexRequestBuilder indexRequestBuilder : indexRequestBuilders) {
            bulkBuilder.add(indexRequestBuilder);
        }
        BulkResponse reponse = bulkBuilder.execute().actionGet();
        if (reponse.hasFailures()) {
            isSucceed = false;
        }
        return isSucceed;
    }
    /**
     * 添加记录到cache，如果cache达到maxCacheCount，则提交
     */
    public static void addDocToCache(IndexRequestBuilder irb,int bachSize) {
        maxCacheCount=bachSize;
        commitLock.lock();
        try {
            cache.add(irb);
            log.info("cache commit maxCacheCount:"+maxCacheCount);
            if (cache.size() >= maxCacheCount) {
                log.info("cache commit count:"+cache.size());
                blukIndex(cache);
                cache.clear();
            }
        } catch (Exception ex) {
            log.info(ex.getMessage());
        } finally {
            commitLock.unlock();
        }
    }
    /**
     * 提交定时器
     */
    static class CommitTimer extends TimerTask {
        @Override
        public void run() {
            commitLock.lock();
            try {
                if (cache.size() > 0) { //大于0则提交
                    log.info("timer commit count:"+cache.size());
                    blukIndex(cache);
                    cache.clear();
                }
            } catch (Exception ex) {
                log.info(ex.getMessage());
            } finally {
                commitLock.unlock();
            }
        }
    }
}

ElasticsearchClientFactory提供Client创建和bulk json功能

public class ElasticSearchClientFactory {

    private static Client client;

    public static Client getClient(){
        return client;
    }

    public static Client getClient(String hostName, String clusterName)
            throws NoSuchClientTypeException {
        if (client == null) {
            // 集群模式 设置Settings
            System.out.println(">>>>>>>>>>>>>>>" + clusterName + ">>>>>" + hostName);
            Settings settings = Settings.settingsBuilder()
                    .put("cluster.name", clusterName).build();
            try {
                client = TransportClient
                        .builder()
                        .settings(settings)
                        .build()
                        .addTransportAddress(
                                new InetSocketTransportAddress(InetAddress
                                        .getByName(hostName), 9300));
            } catch (UnknownHostException e) {
                e.printStackTrace();
            }
        }
        return client;
    }

    public static String generateJson(String[] fields,String text,String splitStr) throws ContentNoCaseFieldsException{
        String[] texts=null;
        if(!StringUtils.isEmpty(text)){
            texts=text.split(splitStr);
            if(texts.length==fields.length){
                String json = "";
                try {
                    XContentBuilder contentBuilder = XContentFactory.jsonBuilder()
                            .startObject();
                    for(int i=0;i<fields.length;i++){
                        contentBuilder.field(fields[i],texts[i]);
                    }
                    json = contentBuilder.endObject().string();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                return json;
            }
        }
        throw new ContentNoCaseFieldsException();
    }
}

(2) 配置Flume文件elasticsearch.conf

a1.sources =r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type =spooldir
a1.sources.r1.channels =c1
a1.sources.r1.spoolDir =/home/an/log
a1.sources.r1.fileHeader =true

# Describe the sink
a1.sinks.k1.type =org.plugins.sink.ElasticSearchCoreSink
a1.sinks.k1.hostName =192.168.226.120
a1.sinks.k1.indexName =dxmessages
a1.sinks.k1.indexType =dxmessage
a1.sinks.k1.clusterName =es_cloud
a1.sinks.k1.fields =rowkey,content,phone_no,publish_date
a1.sinks.k1.ttl =5m
a1.sinks.k1.splitStr =\\|
a1.sinks.k1.bachSize =100

# Use a channel which buffers events inmemory
a1.channels.c1.type =memory
a1.channels.c1.capacity =1000
a1.channels.c1.transactionCapacity =100

# Bind the source and sink to the channel
a1.sources.r1.channels =c1
a1.sinks.k1.channel =c1

二、建立Elasticsearch文档信息

Jcseg 作为国内知名的开源的中文分词器，对于中文分词有其独有的特点，对于 elasticsearch 这一不错的文档检索引擎来说 Elasticsearch + Jcseg 这个组合，在处理中文检索上，可以说是黄金搭档啊！！最重要的是：方便，简单，易用，功能强大！Elasticsearch与jcseg整合，参考https://github.com/lionsoul2014/jcseg。配置jcseg.properties文件内容如下（不待翻译了，多看几遍就看懂了）：

# jcseg properties file.
# Jcseg function
#maximum match length. (5-7)
jcseg.maxlen=5

#recognized the chinese name.(1 to open and 0 to close it)
jcseg.icnname=1

#maximum chinese word number of english chinese mixed word. 
jcseg.mixcnlen=3

#maximum length for pair punctuation text.
jcseg.pptmaxlen=15

#maximum length for chinese last name andron.
jcseg.cnmaxlnadron=1

#Wether to clear the stopwords.(set 1 to clear stopwords and 0 to close it)
jcseg.clearstopword=0

#Wether to convert the chinese numeric to arabic number. (set to 1 open it and 0 to close it)
# like '\u4E09\u4E07' to 30000.
jcseg.cnnumtoarabic=1

#Wether to convert the chinese fraction to arabic fraction.
jcseg.cnfratoarabic=1

#Wether to keep the unrecognized word. (set 1 to keep unrecognized word and 0 to clear it)
jcseg.keepunregword=1

#Wether to start the secondary segmentation for the complex english words.
jcseg.ensencondseg = 1

#min length of the secondary simple token. (better larger than 1)
jcseg.stokenminlen = 2

#thrshold for chinese name recognize.
# better not change it before you know what you are doing.
jcseg.nsthreshold=1000000

#The punctuations that will be keep in an token.(Not the end of the token).
jcseg.keeppunctuations=@#%.&+




####about the lexicon
#prefix of lexicon file.
lexicon.prefix=lex

#suffix of lexicon file.
lexicon.suffix=lex

#abusolte path of the lexicon file.
#Multiple path support from jcseg 1.9.2, use ';' to split different path.
#example: lexicon.path = /home/chenxin/lex1;/home/chenxin/lex2 (Linux)
#     : lexicon.path = D:/jcseg/lexicon/1;D:/jcseg/lexicon/2 (WinNT)
#lexicon.path=/java/JavaSE/jcseg/lexicon
lexicon.path={jar.dir}/lexicon

#Wether to load the modified lexicon file auto.
lexicon.autoload=0

#Poll time for auto load. (seconds)
lexicon.polltime=120

####lexicon load
#Wether to load the part of speech of the entry.
jcseg.loadpos=1

#Wether to load the pinyin of the entry.
jcseg.loadpinyin=0

#Wether to load the synoyms words of the entry.
jcseg.loadsyn=0

在elasticsearch中建立对应索引模型：

curl -XPOST 'http://192.168.226.120:9200/dxmessages?pretty' -d '{
   "mappings": {
    "dxmessage": {
      "_ttl": {
        "enabled": true
      },
      "properties" : {
            "rowkey" : {"type" : "string","index" : "not_analyzed"},
            "content" :{"type" : "string","analyzer" : "jcseg_simple","searchAnalyzer": "jcseg"},
            "phone_no" : {"type" : "string","index" : "not_analyzed"},
            "publish_date" :{"type" : "string","index" : "not_analyzed"}
      }
    }
  }
}'