需求
商品大量数据存储在Mongodb,但是对于页面展示和搜索的话,效率太低,而且db中的数据很多对于搜索来说是不需要的,所以,需要将数据同步给elastic search, 但是如果业务人员手动同步的话,那么会大大增大业务人员的开发量。经过调研的话,采用Mongostache来实时监听和同步数据。
什么是Monstache
monstache是个数据迁移工具,它可以实现通过监听oplog或者change-stream(这取决于Mongodb版本),实现数据从mongodb->elasticSearch的过程,同时支持自定义数据转化中间件的定义。(官方文档)
Monstache实时监听mongodb,在做CUD的时候,通过oplog或者change-stream(这取决于mongodb的版本)来实施同步,避免大量的业务代码,实现自动同步数据。数据流程如下:
请注意:
- Monstache只负责数据迁移,但是数据结构和类型,是不会在意的,所以,如果你对数据类型有严格要求的话,你需要在迁移之前自定义好数据结构,不论是mongodb还是elastic search。
- mongodb如果有sharding的话,那么需要配置mongo-config-url
- mongodb的需要支持复制集
- monstache同步的过程中会初始化属性类型,相同参数如路径,对应的类型必须相同,否则会失败
使用case
因为我们的版本如下:
环境 | MongoDB Verion | ES Version | Monstache Version |
---|---|---|---|
生产 | 3.4.6 | 7.7.1 | 6 |
安装Monstache
参考文档:https://thoughts.teambition.com/workspaces/5e32ab5e9fe52f001c42de03/docs/619cc1d64e91bf000113a796?scroll-to-block=619e20b53bf5830012bddad2
通常有两个选择:
- 下载github源码,然后通过go来install,但是可能网络原因,但是安装失败
- 下载完整的.zip,然后上传到服务器上,运行
配置config.xml
需要根据需求,配置文件,配置如下,仅供参考:
# connection settings
# connect to MongoDB using the following URL
mongo-url = "mongodb://root:8duVIl_2uA@10.0.0.97:37018/ovms?authSource=admin"
# connect to the Elasticsearch REST API at the following node URLs
elasticsearch-urls = ["http://10.0.0.141:9200"]
# frequently required settings
# if you need to seed an index from a collection and not just listen and sync changes events
# you can copy entire collections or views from MongoDB to Elasticsearch
direct-read-namespaces = ["ovms.ovms_bt_complex_product"]
# if you want to use MongoDB change streams instead of legacy oplog tailing use change-stream-namespaces
# change streams require at least MongoDB API 3.6+
# if you have MongoDB 4+ you can listen for changes to an entire database or entire deployment
# in this case you usually don't need regexes in your config to filter collections unless you target the deployment.
# to listen to an entire db use only the database name. For a deployment use an empty string.
# change-stream-namespaces = ["ovms.ovms_bt_complex_product"]
# 版本3.6下,使用oplog
enable-oplog = true
# additional settings
# if you don't want to listen for changes to all collections in MongoDB but only a few
# e.g. only listen for inserts, updates, deletes, and drops from mydb.mycollection
# this setting does not initiate a copy, it is only a filter on the change event listener
namespace-regex = '^ovms\.ovms_bt_complex_product$'
# compress requests to Elasticsearch
# gzip = true
# generate indexing statistics
# stats = true
# index statistics into Elasticsearch
# index-stats = true
# use the following PEM file for connections to MongoDB
# mongo-pem-file = "/path/to/mongoCert.pem"
# disable PEM validation
# mongo-validate-pem-file = false
# use the following user name for Elasticsearch basic auth
elasticsearch-user = ""
# use the following password for Elasticsearch basic auth
elasticsearch-password = ""
# use 4 go routines concurrently pushing documents to Elasticsearch
elasticsearch-max-conns = 4
# use the following PEM file to connections to Elasticsearch
# elasticsearch-pem-file = "/path/to/elasticCert.pem"
# validate connections to Elasticsearch
# elastic-validate-pem-file = true
# propogate dropped collections in MongoDB as index deletes in Elasticsearch
dropped-collections = true
# propogate dropped databases in MongoDB as index deletes in Elasticsearch
dropped-databases = true
# do not start processing at the beginning of the MongoDB oplog
# if you set the replay to true you may see version conflict messages
# in the log if you had synced previously. This just means that you are replaying old docs which are already
# in Elasticsearch with a newer version. Elasticsearch is preventing the old docs from overwriting new ones.
#replay = false
# resume processing from a timestamp saved in a previous run
# resume = true
# do not validate that progress timestamps have been saved
# resume-write-unsafe = false
# override the name under which resume state is saved
# resume-name = "default"
# use a custom resume strategy (tokens) instead of the default strategy (timestamps)
# tokens work with MongoDB API 3.6+ while timestamps work only with MongoDB API 4.0+
# resume-strategy = 0
# exclude documents whose namespace matches the following pattern
# namespace-exclude-regex = '^mydb\.ignorecollection$'
# turn on indexing of GridFS file content
# index-files = true
# turn on search result highlighting of GridFS content
# file-highlighting = true
# index GridFS files inserted into the following collections
# file-namespaces = ["users.fs.files"]
# print detailed information including request traces
verbose = true
# enable clustering mode
# cluster-name = 'es-cn-mp91kzb8m00******'
# do not exit after full-sync, rather continue tailing the oplog
# exit-after-direct-reads = false
# 日志配置,可能是因为测试环境设置,文件必须放在/data/logs下
[logs]
info = "/data/logs/_monstache/build/log/info.log"
warn = "/data/logs/_monstache/build/log/warn.log"
error = "/data/logs/_monstache/build/log/error.log"
trace = "/data/logs/_monstache/build/log/trace.log"
[log-rotate]
max-size = 10
max-backups = 10
max-age = 60
localtime = true
compress = true
[[script]]
namespace = "ovms.ovms_bt_complex_product"
path = "/usr/web/_monstache/build/transfer/ovms_bt_complex_product.js"
# routing = true
# [[pipeline]]
# namespace = "ovms.ovms_bt_complex_product"
# path = "/usr/web/_monstache/build/pipeline/ovms_bt_complex_product.js"
# routing = true
注意配置项目:
- elasticsearch-urls = [“http://10.0.0.141:9200”]
- mongo-url = “mongodb://root:8duVIl_2uA@10.0.0.97:37018/ovms?authSource=admin”
- direct-read-namespaces = [“ovms.ovms_bt_complex_product”] - 直接读写的数据表
- 版本3.6下,使用oplog,enable-oplog = true
如果[logs] 和[log-rorate]如果是想看日志的话,那么请先建立好文件目录。转化脚本通常会放在某个目录下方便管理:
[[script]]
namespace = "ovms.ovms_bt_complex_product"
path = "/usr/web/_monstache/build/transfer/ovms_bt_complex_product.js"
direct-read-namespaces:direct-read-namespaces无法保证仅有的指定的collectionton同步到elasticsearch(可能是个bug), 需要使用了namespace-regex = ‘^mydb.mycol$’,但只能匹配单个的库或者表。此时可以使用monstache的插件功能,可以用go,javascript等来实现,配置文件中使用javascrip来过滤不在direct-read-namespaces里的namespce。不过这样效率会比namespace-regex低,参考:
[[filter]]
script = """
module.exports = function(doc, ns) {
switch(ns) {
case "mydb.mycol":
return true;
default:
return false;
}
return false;
}
"""
因为没有用到filter,所以这里没有用目录管理起来了。
运行
执行shell脚本
monstache -f path/to/config.toml
JavaScript写转化脚本
通常来说,mongodb的数据属性是没有完全必要全部迁移到elastic search中的,那么我们需要通过自己写中间件来转化,中间件参考:https://rwynn.github.io/monstache-site/advanced/#middleware
通常来说,脚本有两个选择
- go
- JavaScript
比较如下:
go | javascript | |
---|---|---|
优势 | 速度较快,比较符合monstache本身开发语言 | 比较简单,学习成本比较低 |
劣势 | 开发成本比较高 | 效率比较低 |
出于人力成本和进度,个人选择了JavaScript,参考case(ovms_bt_complex_product.js)
module.exports = function (doc, ns) {
var _doc = _.pick(doc, "companyId", "sellerId", "code");
if (doc.skuList != null && doc.skuList.length !== 0) {
var _skuList = [];
var length = doc.skuList.length;
for (var i = 0; i < length; i++) {
var _sku = _.pick(doc.skuList[i], "sku", "outerSku", "qty", "usableQty", "created", "modified", "creater");
_skuList.push(_sku);
}
_doc.skuList = _skuList;
}
if (doc.customAttributes) {
_doc.customAttributes = _.pick(doc.customAttributes, "attr0", "complex_attribute");
}
if (doc.platforms) {
var _platforms = {};
if (doc.platforms.p0) {
_platforms.p0 = _.pick(doc.platforms.p0, "field1", "field5")
}
_doc.platforms = _platforms;
}
return _doc;
}
脚本需要注意点 :
- 变量推荐使用var定义,不接受let,const不知道
- 元数据中没有值,那么不会同步到ES中,即使在transfer.js中定义了也如此
- 不支持Arrow function expressions =>
Q&A
在这个过程中,你也许会遇到如下几个问题:
- 字段过多错误导致同步失败,怎么办?
重置elastic-search 中的index limitPUT es_cms_bt_product_c928/_settings { "index.mapping.total_fields.limit": 2000 }
- 如果不预先定义ES结构,对于相同节点,属性是做递增吗?
是的,会做并集,如A1{f1,f2},A2{f2,f3,f4},最终ES动态模板是A{f1,f2,f3,f4}