elasticsearch对身份证号码的处理(ngarm分词或者pipeline)

最新推荐文章于 2024-05-15 06:39:40 发布

树叶要走风怎么挽留

最新推荐文章于 2024-05-15 06:39:40 发布

阅读量2.6k

点赞数 1

分类专栏： elasticsearch 技术使用总结知识总结文章标签： elasticsearch 搜索引擎

本文链接：https://blog.csdn.net/weixin_44993313/article/details/107243273

版权

技术使用总结同时被 3 个专栏收录

102 篇文章 0 订阅

订阅专栏

知识总结

84 篇文章 0 订阅

订阅专栏

elasticsearch

48 篇文章 2 订阅

订阅专栏

在项目中，手机号和身份证号码考虑都是为数字(身份证号码可能有X)，所以都设置为了keyword字段

因为项目需要，需要对已经存入数据库的上亿条数据中的手机号和身份证号进行处理，对于身份证号码或手机能够分词

项目中之前使用的为IK分次器，所以不会处理数字。

下面是后来做出的一些解决方案，各有优缺点。

方案1：

使用ngram分词方式，对手机号身份证号进行分词

重新导入数据或者reindex。

优点：能够查询出来的数据结果更多

缺点：数据冗余多

具体实现Demo：

DELETE user_index
GET user_index
PUT user_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ngram_analyzer": {
          "tokenizer": "custom_tokenizer"
        }
      },
      "tokenizer": {
        "custom_tokenizer": {
          "type": "ngram",
          "token_chars": [
            "letter",
            "digit"
          ],
          "min_gram": "5",
          "max_gram": "6"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "card": {
        "type": "text",
        "analyzer": "ngram_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

PUT user_index/_doc/1
{
  "name":"张三",
  "card":"362101198108287617"
}

PUT user_index/_doc/2
{
   "name":"李四",
  "card":"610624199705973918"
}
PUT user_index/_doc/3
{
   "name":"王五",
  "card":"61062419970310000"
}

GET user_index/_search
{
  "query": {
    "match": {
      "card.keyword": {
        "query": "36210",
        "analyzer": "ngram_analyzer"
      }
    }
  }
}
GET user_index/_search
{
  "query": {
    "match": {
      "card": {
        "query": "36210",
        "analyzer": "ngram_analyzer"
      }
    }
  }
}

方案2：
对于存入的手机号，身份证号通过管道进行处理，。分别存储

优点：速度快(分别存储) 不需要重新导入数据

缺点：存储重复数据(分别存储) 会新增了三个字段，没有ngram查询的结果多

具体Demo实现

DELETE user_index
PUT user_index
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "card": {
        "type": "keyword"
      }
    }
  }
}

PUT user_index/_doc/1
{
  "name":"张三",
  "card":"362101198238280617"
}

PUT user_index/_doc/2
{
   "name":"李四",
  "card":"610624199705973918"
}
PUT user_index/_doc/3
{
   "name":"王五",
  "card":"610624199703100002"
}

GET user_index/_search
# 新增一个管道。处理身份证号码
PUT _ingest/pipeline/card_pipeline
{
  "description": "split id zjhm",
  "processors": [
      {
        "script": {
          "source": """
          if(ctx.card!= null && ctx.card.length()== 18){
           ctx.card_start = ctx.card.substring(0, 6);
           ctx.card_centre = ctx.card.substring(6, 14);
           ctx.card_end = ctx.card.substring(14, 18);
          }
"""
        }
      }
    ]
}

GET _ingest/pipeline/card_pipeline

POST _ingest/pipeline/card_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "name": "王五",
        "card": "610624199703100002"
      }
    }
  ]
}

POST user_index/_update_by_query?pipeline=card_pipeline

GET user_index/_search

树叶要走风怎么挽留

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
elasticsearch对身份证号码的处理(ngarm分词或者pipeline)

在项目中，手机号和身份证号码考虑都是为数字(身份证号码可能有X)，所以都设置为了keyword字段因为项目需要，需要对已经存入数据库的上亿条数据中的手机号和身份证号进行处理，对于身份证号码或手机能够分词项目中之前使用的为IK分次器，所以不会处理数字。下面是后来做出的一些解决方案，各有优缺点。方案1：使用ngram分词方式，对手机号身份证号进行分词重新导入数据或者reindex。优点：能够查询出来的数据结果更多缺点：数据冗余多具体实现Demo：DELETE user_indexGET
复制链接

扫一扫

专栏目录