使用Monstache迁移mongodb同步到elastic-search

最新推荐文章于 2024-05-14 09:56:50 发布

asa-x

最新推荐文章于 2024-05-14 09:56:50 发布

阅读量1.3k

点赞数 1

分类专栏：数据库文章标签： mongodb monstache

本文链接：https://blog.csdn.net/u010714901/article/details/121798781

版权

数据库专栏收录该内容

6 篇文章 0 订阅

订阅专栏

需求

商品大量数据存储在Mongodb，但是对于页面展示和搜索的话，效率太低，而且db中的数据很多对于搜索来说是不需要的，所以，需要将数据同步给elastic search, 但是如果业务人员手动同步的话，那么会大大增大业务人员的开发量。经过调研的话，采用Mongostache来实时监听和同步数据。

什么是Monstache

monstache是个数据迁移工具，它可以实现通过监听oplog或者change-stream（这取决于Mongodb版本），实现数据从mongodb->elasticSearch的过程，同时支持自定义数据转化中间件的定义。(官方文档)
Monstache实时监听mongodb，在做CUD的时候，通过oplog或者change-stream（这取决于mongodb的版本）来实施同步，避免大量的业务代码，实现自动同步数据。数据流程如下：

在这里插入图片描述
请注意：

Monstache只负责数据迁移，但是数据结构和类型，是不会在意的，所以，如果你对数据类型有严格要求的话，你需要在迁移之前自定义好数据结构，不论是mongodb还是elastic search。
mongodb如果有sharding的话，那么需要配置mongo-config-url
mongodb的需要支持复制集
monstache同步的过程中会初始化属性类型，相同参数如路径，对应的类型必须相同，否则会失败

使用case

因为我们的版本如下：

环境	MongoDB Verion	ES Version	Monstache Version
生产	3.4.6	7.7.1	6

安装Monstache

参考文档：https://thoughts.teambition.com/workspaces/5e32ab5e9fe52f001c42de03/docs/619cc1d64e91bf000113a796?scroll-to-block=619e20b53bf5830012bddad2
通常有两个选择：

下载github源码，然后通过go来install，但是可能网络原因，但是安装失败
下载完整的.zip，然后上传到服务器上，运行

配置config.xml

需要根据需求，配置文件，配置如下，仅供参考：

# connection settings

# connect to MongoDB using the following URL
mongo-url = "mongodb://root:8duVIl_2uA@10.0.0.97:37018/ovms?authSource=admin"
# connect to the Elasticsearch REST API at the following node URLs
elasticsearch-urls = ["http://10.0.0.141:9200"]

# frequently required settings

# if you need to seed an index from a collection and not just listen and sync changes events
# you can copy entire collections or views from MongoDB to Elasticsearch
direct-read-namespaces = ["ovms.ovms_bt_complex_product"]

# if you want to use MongoDB change streams instead of legacy oplog tailing use change-stream-namespaces
# change streams require at least MongoDB API 3.6+
# if you have MongoDB 4+ you can listen for changes to an entire database or entire deployment
# in this case you usually don't need regexes in your config to filter collections unless you target the deployment.
# to listen to an entire db use only the database name.  For a deployment use an empty string.
# change-stream-namespaces = ["ovms.ovms_bt_complex_product"]
# 版本3.6下，使用oplog
 enable-oplog = true
# additional settings

# if you don't want to listen for changes to all collections in MongoDB but only a few
# e.g. only listen for inserts, updates, deletes, and drops from mydb.mycollection
# this setting does not initiate a copy, it is only a filter on the change event listener
namespace-regex = '^ovms\.ovms_bt_complex_product$'
# compress requests to Elasticsearch
# gzip = true
# generate indexing statistics
# stats = true
# index statistics into Elasticsearch
# index-stats = true
# use the following PEM file for connections to MongoDB
# mongo-pem-file = "/path/to/mongoCert.pem"
# disable PEM validation
# mongo-validate-pem-file = false
# use the following user name for Elasticsearch basic auth
elasticsearch-user = ""
# use the following password for Elasticsearch basic auth
elasticsearch-password = ""
# use 4 go routines concurrently pushing documents to Elasticsearch
elasticsearch-max-conns = 4
# use the following PEM file to connections to Elasticsearch
# elasticsearch-pem-file = "/path/to/elasticCert.pem"
# validate connections to Elasticsearch
# elastic-validate-pem-file = true
# propogate dropped collections in MongoDB as index deletes in Elasticsearch
dropped-collections = true
# propogate dropped databases in MongoDB as index deletes in Elasticsearch
dropped-databases = true
# do not start processing at the beginning of the MongoDB oplog
# if you set the replay to true you may see version conflict messages
# in the log if you had synced previously. This just means that you are replaying old docs which are already
# in Elasticsearch with a newer version. Elasticsearch is preventing the old docs from overwriting new ones.
#replay = false
# resume processing from a timestamp saved in a previous run
# resume = true
# do not validate that progress timestamps have been saved
# resume-write-unsafe = false
# override the name under which resume state is saved
# resume-name = "default"
# use a custom resume strategy (tokens) instead of the default strategy (timestamps)
# tokens work with MongoDB API 3.6+ while timestamps work only with MongoDB API 4.0+
# resume-strategy = 0
# exclude documents whose namespace matches the following pattern
# namespace-exclude-regex = '^mydb\.ignorecollection$'
# turn on indexing of GridFS file content
# index-files = true
# turn on search result highlighting of GridFS content
# file-highlighting = true
# index GridFS files inserted into the following collections
# file-namespaces = ["users.fs.files"]
# print detailed information including request traces
verbose = true
# enable clustering mode
# cluster-name = 'es-cn-mp91kzb8m00******'
# do not exit after full-sync, rather continue tailing the oplog
# exit-after-direct-reads = false
# 日志配置，可能是因为测试环境设置，文件必须放在/data/logs下
[logs]
info = "/data/logs/_monstache/build/log/info.log"
warn = "/data/logs/_monstache/build/log/warn.log" 
error = "/data/logs/_monstache/build/log/error.log" 
trace = "/data/logs/_monstache/build/log/trace.log" 

[log-rotate]
max-size = 10
max-backups = 10
max-age = 60
localtime = true
compress = true


[[script]]
namespace = "ovms.ovms_bt_complex_product"
path = "/usr/web/_monstache/build/transfer/ovms_bt_complex_product.js"
# routing = true

# [[pipeline]]
# namespace = "ovms.ovms_bt_complex_product"
# path = "/usr/web/_monstache/build/pipeline/ovms_bt_complex_product.js"
# routing = true

注意配置项目：

elasticsearch-urls = [“http://10.0.0.141:9200”]
mongo-url = “mongodb://root:8duVIl_2uA@10.0.0.97:37018/ovms?authSource=admin”
direct-read-namespaces = [“ovms.ovms_bt_complex_product”] - 直接读写的数据表
版本3.6下，使用oplog，enable-oplog = true

如果[logs] 和[log-rorate]如果是想看日志的话，那么请先建立好文件目录。转化脚本通常会放在某个目录下方便管理：

[[script]]
namespace = "ovms.ovms_bt_complex_product"
path = "/usr/web/_monstache/build/transfer/ovms_bt_complex_product.js"

direct-read-namespaces：direct-read-namespaces无法保证仅有的指定的collectionton同步到elasticsearch（可能是个bug), 需要使用了namespace-regex = ‘^mydb.mycol$’，但只能匹配单个的库或者表。此时可以使用monstache的插件功能，可以用go，javascript等来实现，配置文件中使用javascrip来过滤不在direct-read-namespaces里的namespce。不过这样效率会比namespace-regex低，参考：


[[filter]]
script = """
module.exports = function(doc, ns) {
    switch(ns) {
      case "mydb.mycol":
        return true;
      default:
        return false;
    }
    return false;
}
"""

因为没有用到filter，所以这里没有用目录管理起来了。

运行

执行shell脚本

monstache -f path/to/config.toml

JavaScript写转化脚本

通常来说，mongodb的数据属性是没有完全必要全部迁移到elastic search中的，那么我们需要通过自己写中间件来转化，中间件参考：https://rwynn.github.io/monstache-site/advanced/#middleware
通常来说，脚本有两个选择

go
JavaScript
比较如下：

	go	javascript
优势	速度较快，比较符合monstache本身开发语言	比较简单，学习成本比较低
劣势	开发成本比较高	效率比较低

出于人力成本和进度，个人选择了JavaScript，参考case（ovms_bt_complex_product.js）

module.exports = function (doc, ns) {
    var _doc = _.pick(doc, "companyId", "sellerId", "code");
    if (doc.skuList != null && doc.skuList.length !== 0) {
        var _skuList = [];
        var length = doc.skuList.length;
        for (var i = 0; i < length; i++) {
            var _sku = _.pick(doc.skuList[i], "sku", "outerSku", "qty", "usableQty", "created", "modified", "creater");
            _skuList.push(_sku);
        }
        _doc.skuList = _skuList;
    }

    if (doc.customAttributes) {
        _doc.customAttributes = _.pick(doc.customAttributes, "attr0", "complex_attribute");
    }

    if (doc.platforms) {
        var _platforms = {};
        if (doc.platforms.p0) {
            _platforms.p0 = _.pick(doc.platforms.p0, "field1", "field5")
        }
        _doc.platforms = _platforms;
    }

    return _doc;
}

脚本需要注意点：

变量推荐使用var定义，不接受let，const不知道
元数据中没有值，那么不会同步到ES中，即使在transfer.js中定义了也如此
不支持Arrow function expressions =>

Q&A

在这个过程中，你也许会遇到如下几个问题：

字段过多错误导致同步失败，怎么办？
重置elastic-search 中的index limit
```
PUT es_cms_bt_product_c928/_settings
{
  "index.mapping.total_fields.limit": 2000
}
```
如果不预先定义ES结构，对于相同节点，属性是做递增吗？
是的，会做并集，如A1{f1,f2}，A2{f2,f3,f4}，最终ES动态模板是A{f1,f2,f3,f4}

asa-x

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
使用Monstache迁移mongodb同步到elastic-search

需求商品大量数据存储在Mongodb，但是对于页面展示和搜索的话，效率太低，而且db中的数据很多对于搜索来说是不需要的，所以，需要将数据同步给elastic search, 但是如果业务人员手动同步的话，那么会大大增大业务人员的开发量。经过调研的话，采用Mongostache来实时监听和同步数据。什么是Monstachemonstache是个数据迁移工具，它可以实现通过监听oplog或者change-stream（这取决于Mongodb版本），实现数据从mongodb->elasticSearc
复制链接

扫一扫

专栏目录