使用Monstache迁移mongodb同步到elastic-search

需求

商品大量数据存储在Mongodb,但是对于页面展示和搜索的话,效率太低,而且db中的数据很多对于搜索来说是不需要的,所以,需要将数据同步给elastic search, 但是如果业务人员手动同步的话,那么会大大增大业务人员的开发量。经过调研的话,采用Mongostache来实时监听和同步数据。

什么是Monstache

monstache是个数据迁移工具,它可以实现通过监听oplog或者change-stream(这取决于Mongodb版本),实现数据从mongodb->elasticSearch的过程,同时支持自定义数据转化中间件的定义。(官方文档)
Monstache实时监听mongodb,在做CUD的时候,通过oplog或者change-stream(这取决于mongodb的版本)来实施同步,避免大量的业务代码,实现自动同步数据。数据流程如下:

在这里插入图片描述
请注意:

  1. Monstache只负责数据迁移,但是数据结构和类型,是不会在意的,所以,如果你对数据类型有严格要求的话,你需要在迁移之前自定义好数据结构,不论是mongodb还是elastic search
  2. mongodb如果有sharding的话,那么需要配置mongo-config-url
  3. mongodb的需要支持复制集
  4. monstache同步的过程中会初始化属性类型,相同参数如路径,对应的类型必须相同,否则会失败

使用case

因为我们的版本如下:

环境MongoDB VerionES VersionMonstache Version
生产3.4.67.7.16
安装Monstache

参考文档:https://thoughts.teambition.com/workspaces/5e32ab5e9fe52f001c42de03/docs/619cc1d64e91bf000113a796?scroll-to-block=619e20b53bf5830012bddad2
通常有两个选择:

  1. 下载github源码,然后通过go来install,但是可能网络原因,但是安装失败
  2. 下载完整的.zip,然后上传到服务器上,运行
配置config.xml

需要根据需求,配置文件,配置如下,仅供参考:

# connection settings

# connect to MongoDB using the following URL
mongo-url = "mongodb://root:8duVIl_2uA@10.0.0.97:37018/ovms?authSource=admin"
# connect to the Elasticsearch REST API at the following node URLs
elasticsearch-urls = ["http://10.0.0.141:9200"]

# frequently required settings

# if you need to seed an index from a collection and not just listen and sync changes events
# you can copy entire collections or views from MongoDB to Elasticsearch
direct-read-namespaces = ["ovms.ovms_bt_complex_product"]

# if you want to use MongoDB change streams instead of legacy oplog tailing use change-stream-namespaces
# change streams require at least MongoDB API 3.6+
# if you have MongoDB 4+ you can listen for changes to an entire database or entire deployment
# in this case you usually don't need regexes in your config to filter collections unless you target the deployment.
# to listen to an entire db use only the database name.  For a deployment use an empty string.
# change-stream-namespaces = ["ovms.ovms_bt_complex_product"]
# 版本3.6下,使用oplog
 enable-oplog = true
# additional settings

# if you don't want to listen for changes to all collections in MongoDB but only a few
# e.g. only listen for inserts, updates, deletes, and drops from mydb.mycollection
# this setting does not initiate a copy, it is only a filter on the change event listener
namespace-regex = '^ovms\.ovms_bt_complex_product$'
# compress requests to Elasticsearch
# gzip = true
# generate indexing statistics
# stats = true
# index statistics into Elasticsearch
# index-stats = true
# use the following PEM file for connections to MongoDB
# mongo-pem-file = "/path/to/mongoCert.pem"
# disable PEM validation
# mongo-validate-pem-file = false
# use the following user name for Elasticsearch basic auth
elasticsearch-user = ""
# use the following password for Elasticsearch basic auth
elasticsearch-password = ""
# use 4 go routines concurrently pushing documents to Elasticsearch
elasticsearch-max-conns = 4
# use the following PEM file to connections to Elasticsearch
# elasticsearch-pem-file = "/path/to/elasticCert.pem"
# validate connections to Elasticsearch
# elastic-validate-pem-file = true
# propogate dropped collections in MongoDB as index deletes in Elasticsearch
dropped-collections = true
# propogate dropped databases in MongoDB as index deletes in Elasticsearch
dropped-databases = true
# do not start processing at the beginning of the MongoDB oplog
# if you set the replay to true you may see version conflict messages
# in the log if you had synced previously. This just means that you are replaying old docs which are already
# in Elasticsearch with a newer version. Elasticsearch is preventing the old docs from overwriting new ones.
#replay = false
# resume processing from a timestamp saved in a previous run
# resume = true
# do not validate that progress timestamps have been saved
# resume-write-unsafe = false
# override the name under which resume state is saved
# resume-name = "default"
# use a custom resume strategy (tokens) instead of the default strategy (timestamps)
# tokens work with MongoDB API 3.6+ while timestamps work only with MongoDB API 4.0+
# resume-strategy = 0
# exclude documents whose namespace matches the following pattern
# namespace-exclude-regex = '^mydb\.ignorecollection$'
# turn on indexing of GridFS file content
# index-files = true
# turn on search result highlighting of GridFS content
# file-highlighting = true
# index GridFS files inserted into the following collections
# file-namespaces = ["users.fs.files"]
# print detailed information including request traces
verbose = true
# enable clustering mode
# cluster-name = 'es-cn-mp91kzb8m00******'
# do not exit after full-sync, rather continue tailing the oplog
# exit-after-direct-reads = false
# 日志配置,可能是因为测试环境设置,文件必须放在/data/logs下
[logs]
info = "/data/logs/_monstache/build/log/info.log"
warn = "/data/logs/_monstache/build/log/warn.log" 
error = "/data/logs/_monstache/build/log/error.log" 
trace = "/data/logs/_monstache/build/log/trace.log" 

[log-rotate]
max-size = 10
max-backups = 10
max-age = 60
localtime = true
compress = true


[[script]]
namespace = "ovms.ovms_bt_complex_product"
path = "/usr/web/_monstache/build/transfer/ovms_bt_complex_product.js"
# routing = true

# [[pipeline]]
# namespace = "ovms.ovms_bt_complex_product"
# path = "/usr/web/_monstache/build/pipeline/ovms_bt_complex_product.js"
# routing = true
 

注意配置项目:

  1. elasticsearch-urls = [“http://10.0.0.141:9200”]
  2. mongo-url = “mongodb://root:8duVIl_2uA@10.0.0.97:37018/ovms?authSource=admin”
  3. direct-read-namespaces = [“ovms.ovms_bt_complex_product”] - 直接读写的数据表
  4. 版本3.6下,使用oplog,enable-oplog = true

如果[logs] 和[log-rorate]如果是想看日志的话,那么请先建立好文件目录。转化脚本通常会放在某个目录下方便管理:

[[script]]
namespace = "ovms.ovms_bt_complex_product"
path = "/usr/web/_monstache/build/transfer/ovms_bt_complex_product.js"

direct-read-namespaces:direct-read-namespaces无法保证仅有的指定的collectionton同步到elasticsearch(可能是个bug), 需要使用了namespace-regex = ‘^mydb.mycol$’,但只能匹配单个的库或者表。此时可以使用monstache的插件功能,可以用go,javascript等来实现,配置文件中使用javascrip来过滤不在direct-read-namespaces里的namespce。不过这样效率会比namespace-regex低,参考:


[[filter]]
script = """
module.exports = function(doc, ns) {
    switch(ns) {
      case "mydb.mycol":
        return true;
      default:
        return false;
    }
    return false;
}
"""
 

因为没有用到filter,所以这里没有用目录管理起来了。

运行

执行shell脚本

monstache -f path/to/config.toml
JavaScript写转化脚本

通常来说,mongodb的数据属性是没有完全必要全部迁移到elastic search中的,那么我们需要通过自己写中间件来转化,中间件参考:https://rwynn.github.io/monstache-site/advanced/#middleware
通常来说,脚本有两个选择

  1. go
  2. JavaScript
    比较如下:
gojavascript
优势速度较快,比较符合monstache本身开发语言比较简单,学习成本比较低
劣势开发成本比较高效率比较低

出于人力成本和进度,个人选择了JavaScript,参考case(ovms_bt_complex_product.js)

module.exports = function (doc, ns) {
    var _doc = _.pick(doc, "companyId", "sellerId", "code");
    if (doc.skuList != null && doc.skuList.length !== 0) {
        var _skuList = [];
        var length = doc.skuList.length;
        for (var i = 0; i < length; i++) {
            var _sku = _.pick(doc.skuList[i], "sku", "outerSku", "qty", "usableQty", "created", "modified", "creater");
            _skuList.push(_sku);
        }
        _doc.skuList = _skuList;
    }

    if (doc.customAttributes) {
        _doc.customAttributes = _.pick(doc.customAttributes, "attr0", "complex_attribute");
    }

    if (doc.platforms) {
        var _platforms = {};
        if (doc.platforms.p0) {
            _platforms.p0 = _.pick(doc.platforms.p0, "field1", "field5")
        }
        _doc.platforms = _platforms;
    }

    return _doc;
}

脚本需要注意点 :

  1. 变量推荐使用var定义,不接受let,const不知道
  2. 元数据中没有值,那么不会同步到ES中,即使在transfer.js中定义了也如此
  3. 不支持Arrow function expressions =>
Q&A

在这个过程中,你也许会遇到如下几个问题:

  1. 字段过多错误导致同步失败,怎么办?
    重置elastic-search 中的index limit
    PUT es_cms_bt_product_c928/_settings
    {
      "index.mapping.total_fields.limit": 2000
    }
    
    
  2. 如果不预先定义ES结构,对于相同节点,属性是做递增吗?
    是的,会做并集,如A1{f1,f2},A2{f2,f3,f4},最终ES动态模板是A{f1,f2,f3,f4}
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值