java searchterm_java 客户端 获取 termvectors

elasticsearch的termvectors包括了term的位置、词频等信息。这些信息用于相应的数据统计或开发其他功能,本文介绍termvecters如何使用,如何通过java客户端获取termvectors相关信息。

要使用termvctor首先要配置mapping中field的"term_vector"属性,默认状态es不开启termvector,因为这样会增加索引的体积,毕竟多存了不少元数据。

PUT test

{

"mappings": {

"qa_test": {

"dynamic": "strict",

"_all": {

"enabled": false

},

"properties": {

"question": {

"properties": {

"cate": {

"type": "keyword"

},

"desc": {

"type": "text",

"store": true,

"term_vector": "with_positions_offsets_payloads",

"analyzer": "ik_smart"

},

"time": {

"type": "date",

"store": true,

"format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"

},

"title": {

"type": "text",

"store": true,

"term_vector": "with_positions_offsets_payloads",

"analyzer": "ik_smart"

}

}

},

"updatetime": {

"type": "date",

"store": true,

"format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"

}

}

}

},

"settings": {

"index": {

"number_of_shards": "1",

"requests": {

"cache": {

"enable": "true"

}

},

"number_of_replicas": "1"

}

}

}

注意示例中的"title"的"term_vector"属性。

接下来为索引创建一条数据

PUT qa_test_02/qa_test/1

{

"question": {

"cate": [

"装修流程",

"其它"

],

"desc": "筒灯,大洋和索正这两个牌子,哪个好?希望内行的朋友告知一下,谢谢!",

"time": "2016-07-02 19:59:00",

"title": "筒灯大洋和索正这两个牌子哪个好"

},

"updatetime": 1467503940000

}

下面我们看看这条数据上question.title字段的termvector信息

GET qa_test_02/qa_test/1/_termvectors

{

"fields": [

"question.title"

],

"offsets": true,

"payloads": true,

"positions": true,

"term_statistics": true,

"field_statistics": true

}

结果大概这个样子

{

"_index": "qa_test_02",

"_type": "qa_test",

"_id": "1",

"_version": 1,

"found": true,

"took": 0,

"term_vectors": {

"question.title": {

"field_statistics": {

"sum_doc_freq": 9,

"doc_count": 1,

"sum_ttf": 9

},

"terms": {

"和": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 2,

"start_offset": 4,

"end_offset": 5

}

]

},

"哪个": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 7,

"start_offset": 12,

"end_offset": 14

}

]

},

"大洋": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 1,

"start_offset": 2,

"end_offset": 4

}

]

},

"好": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 8,

"start_offset": 14,

"end_offset": 15

}

]

},

"正": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 4,

"start_offset": 6,

"end_offset": 7

}

]

},

"牌子": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 6,

"start_offset": 10,

"end_offset": 12

}

]

},

"筒灯": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 0,

"start_offset": 0,

"end_offset": 2

}

]

},

"索": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 3,

"start_offset": 5,

"end_offset": 6

}

]

},

"这两个": {

"doc_freq": 1,

"ttf": 1,

"term_freq": 1,

"tokens": [

{

"position": 5,

"start_offset": 7,

"end_offset": 10

}

]

}

}

}

}

}

下面我们说说如何通过java代码实现termvector的获取,不说废话直接上代码

TermVectorsResponse termVectorResponse = client.prepareTermVectors().setIndex(sourceindexname).setType(sourceindextype)

.setId(id).setSelectedFields(fieldname).setTermStatistics(true).execute()

.actionGet();

XContentBuilder builder = XContentFactory.contentBuilder(XContentType.JSON);

termVectorResponse.toXContent(builder, null);

System.out.println(builder.string());

Fields fields = termVectorResponse.getFields();

Iterator iterator = fields.iterator();

while (iterator.hasNext()) {

String field = iterator.next();

Terms terms = fields.terms(field);

TermsEnum termsEnum = terms.iterator();

while (termsEnum.next() != null) {

BytesRef term = termsEnum.term();

if (term != null) {

System.out.println(term.utf8ToString() + termsEnum.totalTermFreq());

}

}

}

获取TermVectorsResponse的代码很好理解,主要是设置索引名称、索引type、索引id以及需要展示的若干属性。

接下来是如何获取某一term的termvector,有两种方案第一种是通过TermVectorsResponse的toXContent方法直接生成XContentBuilder,这种方法可以直接获取和上面通过DSL查询一样的json结果;第二种是通过Fields的iterator遍历fields,获取TermsEnum,熟悉lucene的同学应会更熟悉第二种方法。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值