elasticsearch入门

最新推荐文章于 2024-08-30 13:29:35 发布

weixin_34109408

最新推荐文章于 2024-08-30 13:29:35 发布

阅读量47

点赞数

文章标签：大数据数据库 java

原文链接：https://my.oschina.net/u/2474041/blog/706485

版权

2019独角兽企业重金招聘Python工程师标准>>>

前言

之前我们一般在做搜索功能时，一般都使用sql的like进行查询，这种做法在数据量小时没什么影响，但随着数据的不断增大，会变的很没效率，影响体验。

为什么使用like查询效率低？

like '%a%'任何情况下都不会走索引，因为索引只是排了个序，对于like '%a%'这种操作索引根本用不上，但对like 'a%'这种是有效的。但是这种方式往往不能满足我们的需求，所以使用elasticsearch就变得非常有必要。

简介

中文学习文档：http://learnes.net/getting_started/installing_es.html

1.elasticsearch是什么？

Elasticsearch 是一个建立在全文搜索引擎 Apache Lucene(TM) 基础上的搜索引擎，可以说 Lucene 是当今最先进，最高效的全功能开源搜索引擎框架。

但是 Lucene 只是一个框架，要充分利用它的功能，你需要使用 JAVA，并且在你的程序中集成 Lucene。更糟的是，你需要做很多的学习了解，才能明白它是如何运行的，Lucene 确实非常复杂。

Elasticsearch 使用 Lucene 作为内部引擎，但是在你使用它做全文搜索时，只需要使用统一开发好的API即可，而并不需要了解其背后复杂的 Lucene 的运行原理。

当然 Elasticsearch 并不仅仅是 Lucene 那么简单，它不仅包括了全文搜索功能，还可以进行以下工作:

分布式实时文件存储，并将每一个字段都编入索引，使其可以被搜索。
实时分析的分布式搜索引擎。
可以扩展到上百台服务器，处理PB级别的结构化或非结构化数据。

Elasticsearch 的上手是非常简单的。它附带了很多非常合理的默认值，这让初学者很好地避免一上手就要面对复杂的理论，它安装好了就可以使用了，用很小的学习成本就可以变得很有生产力。

与数据库的性能比较

测试环境：400万+的数据

oracle:

elasticsearch:

可以看到，数据用时9.3秒，而es仅仅只用了264毫秒！性能相差了35倍。当然，这只是我测试的结果，具体数据跟环境也有一定关系。

安装

下载elasticsearch：elasticsearch.org/download.

将下载好的包解压,切换到bin目录

linux下运行：./elasticsearch

window下运行：elasticsearch.bat

数据

文档通过索引API被索引——存储并使其可搜索。但是最开始我们需要决定我们将文档存储在哪里。一篇文档通过index, type以及id来确定它的唯一性。我们可以自己提供一个_id，或者也使用indexAPI 帮我们生成一个。

index:索引，类似我们的数据库

type:类型，类我们的表

id:主键

shard：分片，是 工作单元 底层的一员，它只负责保存索引中所有数据的一小片。一个索引可以指向一个或多个分片

API的使用

package com.sunsharing.idream.elasticsearch;

import org.apache.lucene.index.Terms;
import org.apache.lucene.util.QueryBuilder;
import org.elasticsearch.action.bulk.BulkRequestBuilder;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.count.CountResponse;
import org.elasticsearch.action.delete.DeleteResponse;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.get.MultiGetItemResponse;
import org.elasticsearch.action.get.MultiGetResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.search.MultiSearchResponse;
import org.elasticsearch.action.search.SearchRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.search.SearchType;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.client.Client;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.index.query.*;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval;
import org.elasticsearch.search.sort.SortOrder;
import org.elasticsearch.search.sort.SortParseElement;

import java.io.IOException;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.Date;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.ExecutionException;

import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
import static org.elasticsearch.index.query.QueryBuilders.termQuery;

public class Elasticsearch {
    /**
     * 获取客户端示例
     *
     * @return
     */
    public static Client getClient() {
        Client client = null;
        try {
            client = TransportClient.builder().build()
                    .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("192.168.2.112"), 9300));
        } catch (UnknownHostException e) {
            e.printStackTrace();
        }
        return client;
    }

    /**
     * 创建json示例
     *
     * @throws IOException
     */
    public static void buildJson() throws IOException {
        XContentBuilder builder = jsonBuilder()
                .startObject()
                .field("user", "kimchy")
                .field("postDate", new Date())
                .field("message", "trying out Elasticsearch")
                .endObject();
        System.out.println(builder.string());
    }
    /**
     * 新增文档示例
     *
     * @param index  索引名
     * @param type   类型名
     * @param source
     */
    public static void add(String index, String type, String source) {
        Client client = getClient();
        //文档Id不传，则交由Elasticsearch创建id（默认自定义uuid）
        IndexResponse response = client.prepareIndex(index, type).setSource(source).get();
        //索引名
        System.out.println(response.getIndex());
        //文档id
        System.out.println(response.getId());
        //类型名
        System.out.println(response.getType());
        //版本号，如若是覆盖，版本号会叠加
        System.out.println(response.getVersion());
        //是否是被创建，如若文档已存在则被覆盖，返回false
        System.out.println(response.isCreated());
        //关闭
        client.close();
    }

    /**
     * 获取文档示例
     *
     * @param index 索引名
     * @param type  类型名
     * @param id    文档ID
     * @return
     */
    public static void get(String index, String type, String id) {
        Client client = getClient();
        GetResponse response = client.prepareGet(index, type, id).get();
        //返回文档的内容（支持各种返回格式）
        Map sourceMap = response.getSource();
        String sourceString = response.getSourceAsString();
        byte[] sourceByte = response.getSourceAsBytes();
        //文档是否存在
        boolean isExists = response.isExists();
        client.close();
    }

    /**
     * 删除文档示例
     *
     * @param index
     * @param type
     * @param id
     */
    public static void delete(String index, String type, String id) {
        Client client = getClient();
        DeleteResponse response = client.prepareDelete(index, type, id).get();
        //文档是否找到
        System.out.println(response.isFound());
        client.close();
    }

    /**
     * 更新文档示例
     *
     * @param index
     * @param type
     * @param id
     */
    public static void update(String index, String type, String id) {
        Client client = getClient();
        UpdateRequest updateRequest = new UpdateRequest();
        updateRequest.index(index);
        updateRequest.type(type);
        updateRequest.id(id);

        try {
            updateRequest.doc(jsonBuilder()
                    .startObject()
                            //要修改的字段
                    .field("message", "aaa")
                    .endObject());
            client.update(updateRequest).get();
            //另一种方式
//            client.prepareUpdate(index, type, id)
//                    .setDoc(jsonBuilder()
//                            .startObject()
//                            .field("gender", "male")
//                            .endObject())
//                    .get();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        } catch (ExecutionException e) {
            e.printStackTrace();
        } finally {
            client.close();
        }
    }


    /**
     * 使用 multiget api获取一组数据
     */
    public static void multiGet() {
        Client client = getClient();
        MultiGetResponse multiGetItemResponses = client.prepareMultiGet()
                .add("testindex", "tweet", "1")
                        //可以获取相同索引/类型下的多个文档
                .add("testindex", "tweet", "2", "3", "4")
                        //也可以获取其他索引/类型下的文档
                .add("cisp", "type", "foo")
                .get();

        for (MultiGetItemResponse itemResponse : multiGetItemResponses) {
            GetResponse response = itemResponse.getResponse();
            //索引必须存在，否则在此会报空指针异常
            if (response.isExists()) {
                String json = response.getSourceAsString();
                System.out.println(json);
            }
        }
        client.close();
    }

    /**
     * bulk API 一次请求可以进行多个操作
     */
    public static void bulkApi() {
        Client client = getClient();
        BulkRequestBuilder bulkRequest = client.prepareBulk();

        try {
            bulkRequest.add(client.prepareIndex("twitter", "tweet", "1")
                            .setSource(jsonBuilder()
                                            .startObject()
                                            .field("user", "kimchy")
                                            .field("postDate", new Date())
                                            .field("message", "trying out Elasticsearch")
                                            .endObject()
                            )
            );

            bulkRequest.add(client.prepareDelete("twitter", "tweet", "2"));

            BulkResponse bulkResponse = bulkRequest.get();
            if (bulkResponse.hasFailures()) {
                //处理错误
                System.out.println(bulkResponse.buildFailureMessage());
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            client.close();
        }
    }

    /**
     * 查询示例
     */
    public static void search() {
        Client client = getClient();
        //全文检索单一段
        MatchAllQueryBuilder qb = QueryBuilders.matchAllQuery();
        //全文检索(多字段)
        MultiMatchQueryBuilder qb1 = QueryBuilders.multiMatchQuery("同", "worksNum", "picAddr", "userInfo.name");
        //terms level query 条件查询，一般在结构化的数据中使用，如表玛、枚举、时间、年龄等..
        TermQueryBuilder qb2 = QueryBuilders.termQuery("userInfo.sex", "1");
        //多条件
        TermsQueryBuilder qb3 = QueryBuilders.termsQuery("tags", "blue", "pill");
        //数字筛选
        RangeQueryBuilder qb4 = QueryBuilders.rangeQuery("price").from(5).to(10);
        SearchResponse response = client.prepareSearch("story")
                .setTypes("picstory")
                        //QUERY_AND_FETCH: 向索引的所有分片（shard）都发出查询请求，各分片返回的时候把元素文档（document）和计算
                        // 后的排名信息一起返回。这种搜索方式是最快的。因为相比下面的几种搜索方式，这种查询方法只需要去shard查询一次。
                        // 但是各个shard返回的结果的数量之和可能是用户要求的size的n倍。
                        //QUERY_THEN_FETCH:    如果你搜索时，没有指定搜索方式，就是使用的这种搜索方式。这种搜索方式，大概分两个步骤，第
                        // 一步，先向所有的shard发出请求，各分片只返回排序和排名相关的信息（注意，不包括文档document)，然后按照各分片
                        // 返回的分数进行重新排序和排名，取前size个文档。然后进行第二步，去相关的shard取document。这种方式返回的docu
                        // ment与用户要求的size是相等的。
                        //DFS_QUERY_AND_FETCH:与QUERY_AND_FETCH相同，预期一个初始的散射相伴用来为更准确的score计算分配了
                        // 的term频率
                        //DFS_QUERY_THEN_FETCH:    与QUERY_THEN_FETCH相同，预期一个初始的散射相伴用来为更准确的score计算分
                        // 配了的term频率。
                .setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
                        //搜索条件对象
                .setQuery(qb2)                 // Query
                        //过滤，12<age<18
                //.setPostFilter(QueryBuilders.rangeQuery("age").from(12).to(18))     // Filter
                        //从第0条显示到第60条，且按匹配度排序
                .setFrom(0).setSize(60).setExplain(true)
                .execute()
                .actionGet();

        //也可以这么搜
        //SearchResponse response1 = client.prepareSearch().execute().actionGet();
        System.out.println(response.getHits().totalHits());
        client.close();
    }

    /**
     * search 请求返回一个单一的结果“页”，而 scroll API 可以被用来检索大量的结果（甚至所有的结果），
     * 就像在传统数据库中使用的游标 cursor，滚动并不是为了实时的用户响应，而是为了处理大量的数据，类似
     * 我们经常写存储过程来处理数据一样（我的理解是这样）
     */
    public static void scroll() {
        Client client = getClient();
        SearchResponse scrollResp = client.prepareSearch("testindex")
                .addSort(SortParseElement.DOC_FIELD_NAME, SortOrder.ASC)
                        //这可以告诉 Elasticsearch 需要保持搜索的上下文环境多久
                .setScroll(new TimeValue(60000))//单位秒
                .setQuery(termQuery("gender", "male"))
                .setSize(1).execute().actionGet();
        //Scroll知道没有数据返回
        while (true) {
            for (SearchHit hit : scrollResp.getHits().getHits()) {
                //处理命中的数据
                System.out.println(hit.getSourceAsString());
            }
            //使用上面的请求返回的结果中包含一个 scroll_id，这个 ID 可以被传递给 scroll API 来检索下一个批次的结果。
            scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
            //没有数据了就跳出循环
            if (scrollResp.getHits().getHits().length == 0) {
                break;
            }
        }
        client.close();
    }

    /**
     * multiSearch 批量查询，会将所有结果同时返回
     */
    public static void multiSearch() {
        Client client = getClient();
        SearchRequestBuilder srb1 = client
                .prepareSearch().setQuery(QueryBuilders.queryStringQuery("elasticsearch")).setSize(1);
        //matchQuery
        SearchRequestBuilder srb2 = client
                .prepareSearch().setQuery(QueryBuilders.matchQuery("name", "kimchy")).setSize(1);

        MultiSearchResponse sr = client.prepareMultiSearch()
                .add(srb1)
                .add(srb2)
                .execute().actionGet();

        // 将会得到所有单个请求的响应
        long nbHits = 0;
        for (MultiSearchResponse.Item item : sr.getResponses()) {
            SearchResponse response = item.getResponse();
            nbHits += response.getHits().getTotalHits();
            System.out.println(response.getHits().getTotalHits());
        }
        System.out.println(nbHits);
    }

    /**
     * aggregation 聚合查询 相当于传统数据库的group by
     */
    public static void aggregation() {
        Client client = getClient();
        SearchResponse sr = client.prepareSearch()
                .setQuery(QueryBuilders.matchAllQuery())
                .addAggregation(
                        AggregationBuilders.terms("colors").field("color")
                ).execute().actionGet();

        // Get your facet results
        org.elasticsearch.search.aggregations.bucket.terms.Terms colors = sr.getAggregations().get("colors");
        for (org.elasticsearch.search.aggregations.bucket.terms.Terms.Bucket bucket : colors.getBuckets()) {
            System.out.println("类型: " + bucket.getKey() + "  分组统计数量 " + bucket.getDocCount() + "  ");
        }
    }

    /**
     * 在搜索到固定文档数后停止搜素
     *
     * @param docsNum
     */
    public static void teminateAfter(int docsNum) {
        Client client = getClient();
        SearchResponse sr = client.prepareSearch("testindex")
                //搜到1个文档后停止搜索
                .setTerminateAfter(docsNum)
                .get();

        if (sr.isTerminatedEarly()) {
            System.out.println(sr.getHits().totalHits());
        }
        client.close();
    }

    /**
     * 获取文档数（2.3api已经不推荐使用）
     */
    public static void count() {
        Client client = getClient();
        CountResponse response = client.prepareCount("testindex")
                .setQuery(termQuery("user", "panda"))
                .execute()
                .actionGet();
        System.out.println(response.getCount());
        client.close();
    }


}

使用过程中需要注意的几点

1.jdk版本必须1.7以上，且client与server的jdk版本必须一致，否则无法识别。

2.不支持无意义词汇搜索，例如单个字母。

3.elasticsearch-jdbc 2.0以后不支持windows,所以不要在windows上试了。

转载于:https://my.oschina.net/u/2474041/blog/706485