【clickhouse随笔】

MergeTree表引擎使用的一些心得
1.除了ODS层的表,其它层的表都不要有Nullable()的字段。因为之后如果要是按照Nullable()的字段做分组,然后保存数据到相应的表,汇总表中的GROUP BY子句中就会有Nullable()字段,就需要对Nullable()字段做处理。如果有多个汇总表,就会对Nullable()字段做多次处理,很麻烦。
2.删除了ReplicatedMergeTree家族表引擎的表之后,ZooKeeper上对应路径的节点并不会马上被删除。要等一段时间后ZooKeeper上的节点才会消失,之后才能创建相同路径的表。一定不要手动删除ZooKeeper上的节点,不然新创建的相同路径可能莫名其妙的就消失了。
3.二级AggregatingMergeTree表引擎DEMO(AggregatingMergeTree的底表也是AggregatingMergeTree)好像不行,第一级的聚合数据会更新,第二级的聚合数据很长时间也不更新。AggregatingMergeTree表引擎可以放到DWS层或ADS层使用。
4.由于二级AggregatingMergeTree表引擎的设想没有通过DEMO实现,因此DWD层还是用Datax或者Spark来处理数据。

二级AggregatingMergeTree表引擎DEMO
-------------------------------------------------- 底表 --------------------------------------------------
CREATE TABLE ods.zxj_base_local ON CLUSTER cluster_01
(
   trade_date Date COMMENT ‘交易日期’,
   trade_year UInt16 DEFAULT 0 COMMENT ‘交易年度’,
   trade_month UInt8 DEFAULT 0 COMMENT ‘交易月份’,
   trade_year_month Int32 DEFAULT 0 COMMENT ‘交易年月’,
   trade_amount Int32 DEFAULT 0 COMMENT ‘交易金额’
)
ENGINE = ReplicatedMergeTree(‘clickhouse/tables/{shard}/cluster_01/zxj_base_local’, ‘{host}’)
PARTITION BY trade_date
ORDER BY trade_date
SETTINGS index_granularity = 8192


CREATE TABLE ods.zxj_base_all ON CLUSTER cluster_01
(
   trade_date Date COMMENT ‘交易日期’,
   trade_year UInt16 DEFAULT 0 COMMENT ‘交易年度’,
   trade_month UInt8 DEFAULT 0 COMMENT ‘交易月份’,
   trade_year_month Int32 DEFAULT 0 COMMENT ‘交易年月’,
   trade_amount Int32 DEFAULT 0 COMMENT ‘交易金额’
)
ENGINE = Distributed(‘cluster_01’, ‘ods’, ‘zxj_base_local’, rand())


-------------------------------------------------- 聚合表引擎的物化视图(第一级) --------------------------------------------------
CREATE MATERIALIZED VIEW ods.zxj_month_local ON CLUSTER cluster_01
ENGINE = ReplicatedAggregatingMergeTree(‘clickhouse/tables/{shard}/cluster_01/zxj_month_local’, ‘{host}’)
PARTITION BY trade_year_month
ORDER BY trade_year_month
POPULATE AS
SELECT
  trade_year_month,
  anyState(trade_year) AS trade_year,
  anyState(trade_month) AS trade_month,
  sumState(trade_amount) AS trade_amount
FROM ods.zxj_base_local
GROUP BY trade_year_month


CREATE TABLE ods.zxj_month_all ON CLUSTER cluster_01
(
   trade_year_month Int32,
   trade_year AggregateFunction(any, UInt16),
   trade_month AggregateFunction(any, UInt8),
   trade_amount AggregateFunction(sum, Int32)
)
ENGINE = Distributed(‘cluster_01’, ‘ods’, ‘zxj_month_local’, rand())


-------------------------------------------------- 聚合表引擎的物化视图(第二级) --------------------------------------------------
CREATE MATERIALIZED VIEW ods.zxj_year_local ON CLUSTER cluster_01
ENGINE = ReplicatedAggregatingMergeTree(‘clickhouse/tables/{shard}/cluster_01/zxj_year_local’, ‘{host}’)
PARTITION BY trade_year
ORDER BY trade_year
SETTINGS index_granularity = 8192 AS
SELECT
  trade_year,
  sumState(trade_amount) AS trade_amount
FROM
(
   SELECT
      anyMerge(trade_year) AS trade_year,
      sumMerge(trade_amount) AS trade_amount
   FROM ods.zxj_month_local
   GROUP BY trade_year_month
)
GROUP BY trade_year

CREATE TABLE ods.zxj_year_all
(
   trade_year UInt16,
   trade_amount AggregateFunction(sum, Int64)
)
ENGINE = Distributed(‘cluster_01’, ‘ods’, ‘zxj_year_local’, rand())

说到数据库中的排序问题,看似简单,实则里面大有文章。今天我们就来聊聊 ClickHouse 的排序。ClickHouse 在建表的时候,会使用主键作为默认的排序键,也是可以在创建表之后修改排序键,也是下面要说的主要话题。

我们先建个 dc_dw_cust_trade_detail_p2 表,我这用的是 CL_TEST 库,建表语句如下:
CREATE TABLE CL_TEST.dc_dw_cust_trade_detail_p2
(
id Int64,
till_num Nullable(Int64),
posted Nullable(String),
trade_date Int32
)
ENGINE = MergeTree
PARTITION BY intDiv(trade_date, 100)
PRIMARY KEY id

修改排序
官方给出的修改排序键的说明:
MODIFY ORDER BY new_expression
该操作仅支持 MergeTree 系列表 (含 replicated 表)。它会将表的 排序键变成 new_expression (元组表达式)。主键仍保持不变。该操作是轻量级的,仅会改变元数据。

看看上面说的多简单,但实际操作确实有几个要注意的地方,下面我总结了个“七言绝句”:

排序决
已有字段不能用,
新列改键写一起。
主键必须排第一,
字段不可值为空。
优化分区可重建,
排序键值不能改。
已有字段不能用
不能使用已经存在的列
ALTER TABLE CL_TEST.dc_dw_cust_trade_detail_p2
MODIFY ORDER BY ( id , trade_date )
异常信息:ClickHouse exception, code: 36, host: 192.168.17.61, port: 8123; Code: 36, e.displayText() = DB::Exception: Existing column trade_date is used in the expression that was added to the sorting key. You can add expressions that use only the newly added columns (version 21.4.6.55 (official build))
报错了,往下看。
新列改键写一起
因为已经存在的字段不能使用,所以要修改排序键的同时还必须增加个新的字段,那么两个事需要写到一个语句中。下面为增加新字段 posted2 并把 posted2 加入到 排序键的元组总,如下:
ALTER TABLE CL_TEST.dc_dw_cust_trade_detail_p2
ADD COLUMN posted2 String,
MODIFY ORDER BY ( posted2)
报错了,往下看。
主键必须排第一
排序键表达式中不使用主键行不行?不行
排序键表达式中主键不放在首位行不行?不行
SQL如下:
– 假设把排序键修改为 posted2 字段
ALTER TABLE CL_TEST.dc_dw_cust_trade_detail_p2
ADD COLUMN posted2 String,
MODIFY ORDER BY ( posted2)
– 假设排序键元组中增加 posted2 字段,并放到 id 前面
ALTER TABLE CL_TEST.dc_dw_cust_trade_detail_p2
ADD COLUMN posted2 String,
MODIFY ORDER BY ( posted2, id)
异常信息:Primary key must be a prefix of the sorting key, but in position 0 its column is id, not posted2 (version 21.4.6.55 (official build)) 主键必须在排序键的前面,在排序键位置0处应该id,而不是posted2。
报错了,往下看。
字段不可值为空
不能是用可为null的字段,在 Nullable  的描述中也说过这个类型不能包含在表索引中。
ALTER TABLE CL_TEST.dc_dw_cust_trade_detail_p2
ADD COLUMN posted2 Nullable(String),
MODIFY ORDER BY (id, posted2)
异常信息:ClickHouse exception, code: 44, host: 192.168.17.61, port: 8123; Code: 44, e.displayText() = DB::Exception: Sorting key cannot contain nullable columns (version 21.4.6.55 (official build))
报错了,往下看。

好了,已经折腾的差不都了,来个可以把 ​posted2​ 加到排序键元组中的正确语句:
ALTER TABLE CL_TEST.dc_dw_cust_trade_detail_p2
ADD COLUMN posted2 String,
MODIFY ORDER BY (id, posted2)

优化分区可重建
排序键我们现在是可以修改完了,但这是个轻量级操作,仅会改变元数据。意思就是索引文件还是原来的,新设置的排序规则对优化查询还起不到任何作用。那么如何让索引文件重建呢?经过尝试发现使用 OPTIMIZE 优化表的语句,可以强制对比数据进行合并和重建索引文件(此为操作后发现所有数据目录内的文件操作时间都变化了,并未对排序实际情况进行查询指标方面的验证)。
OPTIMIZE TABLE CL_TEST.dc_dw_cust_trade_detail_p2 partition 202008;

排序键值不能改
在设计排序键的时候还必须要注意一个事情,那就是包含在排序键中的字段值不让更新的,提示如下:
异常信息:ClickHouse exception, code: 420, host: 192.168.17.61, port: 8123; Code: 420, e.displayText() = DB::Exception: Cannot UPDATE key column posted2 (version 21.4.6.55 (official build))

提出问题:

1.为什么修改排序键时不能使用已经存在的字段?
2.为什么表已经存在主键就可以作为排序键使用?难道因为它本身一直就是一级索引?
3.为什么使用的字段不可以为 null?
4.想把已经存在的列加入到排序键元组中怎么办?

已经在排序键中的字段,在修改排序键的时候能删除掉,或修改顺序嘛??? 我猜是不能。
todo 停机了,回头试试。
ClickHouse 修改排序键后索引是如何更新的?
ClickHouse 是可以通过下面语句修改排序键(这应该也是一级索引吧),但,按官方文档说:“该操作是轻量级的,仅会改变元数据。” 那么,我更改完排序键后索引文件应该没有任何变化,这时我们设置的新排序规则应该对查询没有任何作用? 那么,我就有下面几个问题了,希望大神可以帮解答下,谢谢:
1.更改完排序键后索引文件应该没有任何变化,这时我们设置的新排序规则应该对查询没有任何作用?
2.如何才能根据新设置的排序键进行重建索引?
3.ClickHouse 在什么情况下会自动重建索引?
修改排序键 sql 如下: ALTER TABLE CL_TEST.dc_dw_cust_trade_detail_p2 ADD COLUMN posted2 String, MODIFY ORDER BY (id, posted2)

参考资料:
1.https://clickhouse.tech/docs/en/sql-reference/statements/alter/order-by/ (官方)
2.ALTER MODIFY ORDER BY does not work #13835

一些也许不是坑的坑
ReplacingMergeTree
中文文档截图:

英文文档截图:

文档中描述ReplacingMergeTree表引擎“合并会在后台一个不确定的时间进行,因此你无法预先作出计划”、”尽管你可以调用OPTIMIZE语句发起计划外的合并,但请不要依靠它,因为OPTIMIZE语句会引发数据的大量读写“。
在做联系的过程中发现:
●两次向ReplacingMergeTree表引擎的表插入相同的数据(几十条),在第二次插入数据后的10~15分钟左右,数据会合并。
●使用Datax从MySQL(单表数据在两千万左右,此处用N表示记录数)同步数据到ReplacingMergeTree表引擎的表。在第二次插入数据后,有的时候COUNT(*)得到的结果是2N;有的时候得到的结果会小于2N,但是远远大于N。几个小时之后表中的记录数也不会减少(最长观察时间超过12小时)。
如果没有找到其它的处理方式可以让合并控制在一个有限的时间范围之内,还是要调用OPTIMIZE语句。否则你就需要修改查询的SQL才能得到预期的结果数据,SQL会变得复杂。

CollapsingMergeTree
中文文档截图:

英文文档截图:

文档中描述CollapsingMergeTree会“异步的删除(折叠)”数据,同样会面对和ReplacingMergeTree一样的问题,不知道数据在什么时候合并结束。所以我们还是得调用OPTIMIZE语句。
中文文档中的一个大坑
看ClickHouse官网的中文文档,会发现有些章节的文档是直接用翻译软件翻译的,读起来都不是人话。有些文档是人工翻译的,但是也有读起来不通顺和不好理解的地方。
CollapsingMergeTree章节就遇到的一个大坑,按照中文文档的描述设计了一个拉链表,但是合并后的效果和预期的效果总是不一致。先插入一条记录,然后插入修改此数据的三条记录。合并前有3条“sign=1”和1条“sign=-1”的数据,合并后应该剩下2条“sign=1“的记录。但是经过多次反复测试,合并后只剩下1条”sign=1”的记录,且此记录是生命周期为当前时间可用的那条记录。
对比中文文档和英文文档,会发现对“成对的行”的描述有很大的区别。注意因为文档中的“(ORDER BY)”,这是关键性的描述。折叠的是“建表语句中ORDER BY子句中字段的值相等”的行。中文文档的描述是错误的,会对表的设计产生误导。
按照英文文档的描述,重新设计了拉链表的ORDER BY子句中的字段。经过多次测试,结果与预期一致。(请参照文档中拉链表的设计)

MySQL中各表的数据量
select count() from car_order    
– 总记录数:20843924


select count(
) from car_order_item
– 总记录数:22995000


select count() from pay_order
– 总记录数:23608902


select count(
) from dc_dw_cust_trade_detail
– 总记录数:26584670


SELECT COUNT() FROM dc_dw_cust_info
– 总记录数:2339380


SELECT COUNT(
) FROM dc_dt_tag
– 总记录数:17169508

ReplacingMergeTree表引擎
car_order
因为car_order、car_order_item、pay_order三张表数据只增不减,字段值会有更新,所以采用ReplacingMergeTree表引擎。
目前生产环境中每日站级订单数量在三十万左右,每月订单数量在一千万,我们按月做分区 PARTITION BY replaceOne(substring(trade_date, 1, 7), ‘-’, ‘’)。
ClickHouse表结构:
CREATE TABLE edc181.car_order
(
  id Int64 COMMENT ‘主键’,
  till_num Int64 COMMENT ‘站级订单id’,
  station_code String COMMENT ‘油站编码’,
  car_number Nullable(String) COMMENT ‘车牌号’,
  member_code Nullable(String) COMMENT ‘自建会员码’,
  cust_magkey Nullable(String) COMMENT ‘统建会员码’,
  pos_batch_id Nullable(Int32) COMMENT ‘收银班次号’,
  day_batch_id Int32 COMMENT ‘营业日批次’,
  day_batch_date Nullable(String) COMMENT ‘营业日批次’,
  time_close Nullable(String) COMMENT ‘交易上送时间’,
  posted Nullable(String) COMMENT ‘实际交易时间’,
  pos_id Nullable(Int32) COMMENT ‘POS编号’,
  app_id Nullable(Int64) COMMENT ‘渠道id’,
  trade_date String COMMENT ‘分片键日期’,
  create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
  user_id Nullable(String) COMMENT ‘统建id’,
  update_time DateTime DEFAULT now() COMMENT ‘修改时间’
)
ENGINE = ReplacingMergeTree(update_time)
PARTITION BY replaceOne(substring(trade_date, 1, 7), ‘-’, ‘’)
ORDER BY (trade_date, station_code, day_batch_id, till_num, id)
SETTINGS index_granularity = 8192
Datax配置文件
{
“job”: {
“content”: [{
“reader”: {
“name”: “mysqlreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:mysql://192.168.17.62:3306/zxj_edc181”],
“table”: [“car_order”]
}],
“username”: “edcadmin”,
“password”: “hhyEDC2020”,
“column”: [
“id”,
“till_num”,
“station_code”,
“car_number”,
“member_code”,
“cust_magkey”,
“pos_batch_id”,
“day_batch_id”,
“day_batch_date”,
“time_close”,
“posted”,
“pos_id”,
“app_id”,
“trade_date”,
“create_time”,
“user_id”
]
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“car_order”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“till_num”,
“station_code”,
“car_number”,
“member_code”,
“cust_magkey”,
“pos_batch_id”,
“day_batch_id”,
“day_batch_date”,
“time_close”,
“posted”,
“pos_id”,
“app_id”,
“trade_date”,
“create_time”,
“user_id”
],
“postSql”: [“OPTIMIZE TABLE car_order FINAL”],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

car_order_item
ClickHouse表结构:
CREATE TABLE edc181.car_order_item
(
  id Int64 COMMENT ‘主键’,
  till_num Int64 COMMENT ‘站级订单id’,
  station_code String COMMENT ‘油站编码’,
  item_name Nullable(String) COMMENT ‘商品名称’,
  qty Nullable(Int32) COMMENT ‘购买数量’,
  weight Nullable(String) COMMENT ‘升数或重量’,
  grade Nullable(Int32) COMMENT ‘商品类型 大于0为油品;等于0为非油’,
  ori_price Nullable(Int32) COMMENT ‘单价’,
  bar_code Nullable(Int32) COMMENT ‘商品编码’,
  superdept_name Nullable(String) COMMENT ‘商品一级分类’,
  dept_name Nullable(String) COMMENT ‘商品二级分类’,
  sub_dept_name Nullable(String) COMMENT ‘商品三级分类’,
  full_barcode Nullable(String) COMMENT ‘商品条码’,
  pump_open_counter Nullable(String) COMMENT ‘起始泵码’,
  pump_close_counter Nullable(String) COMMENT ‘终止泵码’,
  pump_id Nullable(Int32) COMMENT ‘油枪号’,
  total_amount Nullable(Int32) COMMENT ‘订单总金额(没用)’,
  discount_amount Nullable(Int32) COMMENT ‘商品的折扣金额’,
  pay_amount Nullable(Int32) COMMENT ‘实付金额(对应接口文档中的TOTAL)’,
  day_batch_id Int32 COMMENT ‘营业日批次’,
  std_total Nullable(Int32) COMMENT ‘标准金额(应付金额)’,
  adj_amt String DEFAULT ‘’ COMMENT ‘零管计算的促销折扣金额 (与STD_TOTAL-TOTAL的差存在一定误差,优惠已STD_TOTAL-TOTAL为主,单独存储)’,
  promo_ref Nullable(Int32) COMMENT ‘是否促销 大于0时为促销活动’,
  promo_name Nullable(String) COMMENT ‘促销活动名称’,
  promo_id Nullable(Int32) COMMENT ‘促销活动编号’,
  trade_date String COMMENT ‘分片键日期’,
  create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
  update_time DateTime DEFAULT now() COMMENT ‘修改时间’
)
ENGINE = ReplacingMergeTree(update_time)
PARTITION BY replaceOne(substring(trade_date, 1, 7), ‘-’, ‘’)
ORDER BY (trade_date, station_code, day_batch_id, till_num, id)
SETTINGS index_granularity = 8192
Datax配置文件
{
“job”: {
“content”: [{
“reader”: {
“name”: “mysqlreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:mysql://192.168.17.62:3306/zxj_edc181”],
“table”: [“car_order_item”]
}],
“username”: “edcadmin”,
“password”: “hhyEDC2020”,
“column”: [
“id”,
“till_num”,
“station_code”,
“item_name”,
“qty”,
“weight”,
“grade”,
“ori_price”,
“bar_code”,
“superdept_name”,
“dept_name”,
“sub_dept_name”,
“full_barcode”,
“pump_open_counter”,
“pump_close_counter”,
“pump_id”,
“total_amount”,
“discount_amount”,
“pay_amount”,
“day_batch_id”,
“std_total”,
“adj_amt”,
“promo_ref”,
“promo_name”,
“promo_id”,
“trade_date”,
“create_time”
]
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“car_order_item”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“till_num”,
“station_code”,
“item_name”,
“qty”,
“weight”,
“grade”,
“ori_price”,
“bar_code”,
“superdept_name”,
“dept_name”,
“sub_dept_name”,
“full_barcode”,
“pump_open_counter”,
“pump_close_counter”,
“pump_id”,
“total_amount”,
“discount_amount”,
“pay_amount”,
“day_batch_id”,
“std_total”,
“adj_amt”,
“promo_ref”,
“promo_name”,
“promo_id”,
“trade_date”,
“create_time”
],
“postSql”: [“OPTIMIZE TABLE car_order_item FINAL”],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

pay_order
ClickHouse表结构:
CREATE TABLE edc181.pay_order
(
  id Int64 COMMENT ‘主键’,
  ser_num Nullable(Int64) COMMENT ‘支付订单号’,
  till_num Int64 COMMENT ‘站级订单id’,
  station_code String COMMENT ‘油站编码’,
  pm_subcode Nullable(Int32) COMMENT ‘支付类型’,
  pmnt_name Nullable(String) COMMENT ‘支付名称’,
  ssum_amount Nullable(Int32) COMMENT ‘支付金额’,
  card_num Nullable(String) COMMENT ‘支付卡号’,
  day_batch_id Int32 COMMENT ‘营业日批次’,
  discount_val Nullable(Int32) COMMENT ‘折扣值 对应支付方式的优惠金额’,
  approval Nullable(String) COMMENT ‘Eps流水 零管中的流水号’,
  ext_authorization_number Nullable(String) COMMENT ‘外部授权号 APP对应的EPS流水号’,
  ext_reference Nullable(String) COMMENT ‘外部参照’,
  trade_date String COMMENT ‘分片键日期’,
  create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
  update_time DateTime DEFAULT now() COMMENT ‘修改时间’
)
ENGINE = ReplacingMergeTree(update_time)
PARTITION BY replaceOne(substring(trade_date, 1, 7), ‘-’, ‘’)
ORDER BY (trade_date, station_code, day_batch_id, till_num, id)
SETTINGS index_granularity = 8192
Datax配置文件
{
   “job”: {
       “content”: [{
           “reader”: {
               “name”: “mysqlreader”,
               “parameter”: {
                   “connection”: [{
                       “jdbcUrl”: [“jdbc:mysql://192.168.17.62:3306/zxj_edc181”],
                       “table”: [“pay_order”]
                  }],
                   “username”: “edcadmin”,
                   “password”: “hhyEDC2020”,
                   “column”: [
                       “id”,
                       “ser_num”,
                       “till_num”,
                       “station_code”,
                       “pm_subcode”,
                       “pmnt_name”,
                       “ssum_amount”,
                       “card_num”,
                       “day_batch_id”,
                       “discount_val”,
                       “approval”,
                       “ext_authorization_number”,
                       “ext_reference”,
                       “trade_date”,
                       “create_time”
                  ]
              }
          },
           “writer”: {
               “name”: “clickhousewriter”,
               “parameter”: {
                   “connection”: [{
                       “jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
                       “table”: [“pay_order”]
                  }],
                   “username”: “licl”,
                   “password”: “123456”,
                   “column”: [
                       “id”,
                       “ser_num”,
                       “till_num”,
                       “station_code”,
                       “pm_subcode”,
                       “pmnt_name”,
                       “ssum_amount”,
                       “card_num”,
                       “day_batch_id”,
                       “discount_val”,
                       “approval”,
                       “ext_authorization_number”,
                       “ext_reference”,
                       “trade_date”,
                       “create_time”
                  ],
                   “postSql”: [“OPTIMIZE TABLE pay_order FINAL”],
                   “batchSize”: 65536,
                   “batchByteSize”: 134217728,
                   “dryRun”: false,
                   “writeMode”: “insert”
              }
          }
      }],
       “setting”: {
           “speed”: {
               “channel”: 5
          }
      }
  }
}

dc_dw_cust_info
因为dc_dw_cust_info表数据只增不减,字段值会有更新,所以采用ReplacingMergeTree表引擎。
(正常业务数据应该是只增不减,但是在之前的维护工作中会有删除数据的情况,比如和会员、订单中心的用户数据对不上的时候,需要手动删除数据。)
目前生产环境中有二百三十多万数据,预估用户量不会超过一千万,所以不设置 PARTITION BY。
ClickHouse表结构:
CREATE TABLE edc181.dc_dw_cust_info
(
  id Int64 COMMENT ‘主键’,
  user_id Nullable(Int64) COMMENT ‘会员id(会员营销系统会员表中的id)’,
  cust_magkey Nullable(String) COMMENT ‘中油统建用户标识(会员营销系统会员表中user_id)’,
  phone Nullable(String) COMMENT ‘手机号’,
  number_plate Nullable(String) COMMENT ‘车牌号’,
  sex Nullable(String) COMMENT ‘性别(0-男;1-女;2-未知)’,
  belong_org String DEFAULT ‘900002’ COMMENT ‘会员所属机构(机构编码) 默认为省归属地’,
  first_trade_station Nullable(String) DEFAULT ‘’ COMMENT ‘首次消费站点(零管编码)’,
  last_trade_oil_station Nullable(String) COMMENT ‘最后加油油站(零管编码)’,
  last_trade_oil_time Nullable(DateTime) COMMENT ‘最后加油时间’,
  registration_time Nullable(DateTime) COMMENT ‘用户注册时间’,
  cust_tag Nullable(Int8) DEFAULT 0 COMMENT ‘客户标志 0:正常 1:历史车牌数据’,
  registration_type Nullable(Int8) COMMENT ‘注册类型(1:线上,2:线下)’,
  registration_source String DEFAULT ‘’ COMMENT ‘好客e站公众号、好客e站app、第三方、官方微信、加油站 (0-好客e站app,1-app,2-省官微)’,
  registration_site String DEFAULT ‘’ COMMENT ‘注册地点。当 registration_source 字段的值为 加油站 时,这里应该是加油站机构编码。’,
  create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
  update_time DateTime DEFAULT now() COMMENT ‘更新时间’,
  cust_profession Nullable(Int64) COMMENT ‘会员职业’,
  birthday Nullable(String) COMMENT ‘生日(格式:yyyy-MM-dd)’,
  open_id Nullable(String) COMMENT 'openID ',
  cust_level Nullable(Int8) DEFAULT 1 COMMENT ‘会员等级(1:普通,2:黄金,3:铂金,4:砖石)’,
  registration_ip Nullable(String) COMMENT ‘注册IP’,
  user_type Int16 DEFAULT 0 COMMENT ‘用户类型(0-普通用户;1-中石油员工;2-第三方员工)’,
  user_update_time Nullable(Int64) COMMENT ‘会员系统中会员表数据最后更新时间,此字段用于会员数据同步时间条件’,
  bus_special_type_id Nullable(Int64) COMMENT ‘专车类型’,
  subscription_org_id Nullable(Int64) COMMENT ‘订阅地市’,
  deleted Nullable(Int8) DEFAULT 0 COMMENT ‘删除标记’,
  identity_name Nullable(String) COMMENT ‘身份证姓名’
)
ENGINE = ReplacingMergeTree(update_time)
ORDER BY id;
Datax配置文件:
{
“job”: {
“content”: [{
“reader”: {
“name”: “mysqlreader”,
“parameter”: {
“username”: “edcadmin”,
“password”: “hhyEDC2020”,
“column”: [
“id”,
“user_id”,
“cust_magkey”,
“phone”,
“number_plate”,
“sex”,
“belong_org”,
“first_trade_station”,
“last_trade_oil_station”,
“last_trade_oil_time”,
“registration_time”,
“cust_tag”,
“registration_type”,
“registration_source”,
“registration_site”,
“create_time”,
“update_time”,
“cust_profession”,
“birthday”,
“open_id”,
“cust_level”,
“registration_ip”,
“user_type”,
“user_update_time”,
“bus_special_type_id”,
“subscription_org_id”,
“deleted”,
“identity_name”
],
“connection”: [{
“table”: [
“dc_dw_cust_info”
],
“jdbcUrl”: [
“jdbc:mysql://192.168.17.62:3306/zxj_edc181”
]
}]
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“dc_dw_cust_info”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“user_id”,
“cust_magkey”,
“phone”,
“number_plate”,
“sex”,
“belong_org”,
“first_trade_station”,
“last_trade_oil_station”,
“last_trade_oil_time”,
“registration_time”,
“cust_tag”,
“registration_type”,
“registration_source”,
“registration_site”,
“create_time”,
“update_time”,
“cust_profession”,
“birthday”,
“open_id”,
“cust_level”,
“registration_ip”,
“user_type”,
“user_update_time”,
“bus_special_type_id”,
“subscription_org_id”,
“deleted”,
“identity_name”
],
“postSql”: [
“OPTIMIZE TABLE dc_dw_cust_info FINAL”
],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

reader中也可以根据MySQL中dc_dw_cust_info表的update_time字段查询增量数据。

CollapsingMergeTree表引擎(手动维护的基础数据表)
系统的基础数据表(业务字典、配置等)数据量小,所以不设置 PARTITION BY。
数据会有修改和删除,并且业务数据中会使用到修改、删除之前的数据,所以设计成拉链表,采用ReplacingMergeTree表引擎
dc_base_comm_category_code
ClickHouse表结构:
CREATE TABLE edc181.dc_base_comm_category_code
(
id Int64 COMMENT ‘主键’,
code Int32 COMMENT ‘商品类别编码(数据中心侧)’,
name String COMMENT ‘商品类别名称(数据中心侧)’,
sub_code Int32 COMMENT ‘商品子类别编码(数据中心侧)’,
sub_name String COMMENT ‘商品子类别名称(数据中心侧)’,
biz_code Nullable(String) COMMENT ‘业务编码(实际业务侧)’,
biz_name Nullable(String) COMMENT ‘业务名称(实际业务侧)’,
grade Nullable(Int16) DEFAULT 0 COMMENT ‘商品类型,大于0为油品;等于0为非油’,
create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
update_time DateTime DEFAULT now() COMMENT ‘更新时间’,
dt_begin DateTime DEFAULT now() COMMENT ‘生命周期起始时间’,
dt_end DateTime DEFAULT toDateTime(‘2100-01-01 00:00:00’) COMMENT ‘生命周期截止时间’,
sign Int8 DEFAULT 1 COMMENT ‘折叠表类型列:1-状态行;-1-取消行’
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY (id, dt_end)
SETTINGS index_granularity = 8192

维护数据的SQL示例:
– 添加记录操作
INSERT INTO dc_base_comm_category_code ( id, code, name, sub_code, sub_name, biz_code, biz_name, grade, create_time, update_time, dt_begin, dt_end, sign )
VALUES ( 100, 80000, ‘测试name’, 80001, ‘测试sub_name’, ‘888888’, ‘测试biz_name’, 0, now( ), now( ), now( ), toDateTime ( ‘2100-01-01 00:00:00’ ), 1 );

– 也可以省略 create_time, update_time, dt_begin, dt_end, sign 字段,这些字段有默认值。
INSERT INTO dc_base_comm_category_code ( id, code, name, sub_code, sub_name, biz_code, biz_name, grade )
VALUES ( 100, 80000, ‘测试name’, 80001, ‘测试sub_name’, ‘888888’, ‘测试biz_name’, 0);

– 修改记录操作
– 1.插入一条除dt_end(生命周期截止时间)字段被修改,其它字段和原有记录相同的数据。
INSERT INTO dc_base_comm_category_code ( id, code, name, sub_code, sub_name, biz_code, biz_name, grade, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, code, name, sub_code, sub_name, biz_code, biz_name, grade, create_time, update_time, dt_begin, now( ), sign
FROM dc_base_comm_category_code
WHERE id = 100 AND sign = 1 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ )

– 2.插入一条除sign(折叠表类型列)字段被修改,其它字段和原有记录相同的数据。
INSERT INTO dc_base_comm_category_code ( id, code, name, sub_code, sub_name, biz_code, biz_name, grade, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, code, name, sub_code, sub_name, biz_code, biz_name, grade, create_time, update_time, dt_begin, dt_end, -1
FROM dc_base_comm_category_code
WHERE id = 100 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ )

– 3.插入修改后数据(演示修改name字段的值)。
INSERT INTO dc_base_comm_category_code ( id, code, name, sub_code, sub_name, biz_code, biz_name, grade, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, code, ‘测试name修改’, sub_code, sub_name, biz_code, biz_name, grade, create_time, dt_end, dt_end, toDateTime ( ‘2100-01-01 00:00:00’ ), 1
FROM dc_base_comm_category_code
WHERE id = 100 AND sign = 1 AND dt_end != toDateTime ( ‘2100-01-01 00:00:00’ )

– 4.合并数据。
OPTIMIZE TABLE dc_base_comm_category_code FINAL

– 删除记录操作(执行“修改记录操作”中的步骤1、2、4。不执行步骤3 !!!)

dc_base_pay_code
ClickHouse表结构:
CREATE TABLE edc181.dc_base_pay_code
(
id Int64 COMMENT ‘主键’,
code Int32 COMMENT ‘支付方式编码(数据中心侧)’,
name String COMMENT ‘支付方式名称(数据中心侧)’,
biz_code String COMMENT ‘支付方式编码(实际业务侧)’,
biz_name Nullable(String) COMMENT ‘支付方式编码(实际业务侧)’,
create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
update_time DateTime DEFAULT now() COMMENT ‘更新时间’,
dt_begin DateTime DEFAULT now() COMMENT ‘生命周期起始时间’,
dt_end DateTime DEFAULT toDateTime(‘2100-01-01 00:00:00’) COMMENT ‘生命周期截止时间’,
sign Int8 DEFAULT 1 COMMENT ‘折叠表类型列:1-状态行;-1-取消行’
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY (id, dt_end)
SETTINGS index_granularity = 8192

维护数据的SQL示例:
– 添加记录操作
INSERT INTO dc_base_pay_code ( id, code, name, biz_code, biz_name, create_time, update_time, dt_begin, dt_end, sign )
VALUES ( 100, 8888, ‘测试name’, ‘测试biz_code’, ‘测试biz_name’, now( ), now( ), now( ), toDateTime ( ‘2100-01-01 00:00:00’ ), 1 )

– 也可以省略 create_time, update_time, dt_begin, dt_end, sign 字段,这些字段有默认值。
INSERT INTO dc_base_pay_code ( id, code, name, biz_code, biz_name )
VALUES ( 100, 8888, ‘测试name’, ‘测试biz_code’, ‘测试biz_name’ )

– 修改记录操作
– 1.插入一条除dt_end(生命周期截止时间)字段被修改,其它字段和原有记录相同的数据。
INSERT INTO dc_base_pay_code ( id, code, name, biz_code, biz_name, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, code, name, biz_code, biz_name, create_time, update_time, dt_begin, now( ), sign
FROM dc_base_pay_code
WHERE id = 100 AND sign = 1 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ )

– 2.插入一条除sign(折叠表类型列)字段被修改,其它字段和原有记录相同的数据。
INSERT INTO dc_base_pay_code ( id, code, name, biz_code, biz_name, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, code, name, biz_code, biz_name, create_time, update_time, dt_begin, dt_end, -1
FROM dc_base_pay_code
WHERE id = 100 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ )

– 3.插入修改后数据(演示修改name字段的值)。
INSERT INTO dc_base_pay_code ( id, code, name, biz_code, biz_name, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, code, ‘测试name修改’, biz_code, biz_name, create_time, dt_end, dt_end, toDateTime ( ‘2100-01-01 00:00:00’ ), 1
FROM dc_base_pay_code
WHERE id = 100 AND sign = 1 AND dt_end != toDateTime ( ‘2100-01-01 00:00:00’ )

– 4.合并数据。
OPTIMIZE TABLE dc_base_pay_code FINAL

– 删除记录操作(执行“修改记录操作”中的步骤1、2、4。不执行步骤3 !!!)

dc_base_price_band
ClickHouse表结构:
CREATE TABLE edc181.dc_base_price_band
(
id UInt64 COMMENT ‘主键’,
price_band_category String COMMENT ‘价格带分类’,
price_band String COMMENT ‘价格带’,
min_value Nullable(Int32) COMMENT ‘下限值(分)’,
max_value Nullable(Int32) COMMENT ‘上限值(分)’,
create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
update_time DateTime DEFAULT now() COMMENT ‘最后修改时间’,
dt_begin DateTime DEFAULT now() COMMENT ‘生命周期起始时间’,
dt_end DateTime DEFAULT toDateTime(‘2100-01-01 00:00:00’) COMMENT ‘生命周期截止时间’,
sign Int8 DEFAULT 1 COMMENT ‘折叠表类型列:1-状态行;-1-取消行’
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY (id, dt_end)
SETTINGS index_granularity = 8192

维护数据的SQL示例:
– 添加记录操作
INSERT INTO dc_base_price_band ( id, price_band_category, price_band, min_value, max_value, create_time, update_time, dt_begin, dt_end, sign )
VALUES ( 100, ‘测试price_band_category’, ‘测试price_band’, 1, 9999, now( ), now( ), now( ), toDateTime ( ‘2100-01-01 00:00:00’ ), 1 )

– 也可以省略 create_time, update_time, dt_begin, dt_end, sign 字段,这些字段有默认值。
INSERT INTO dc_base_price_band ( id, price_band_category, price_band, min_value, max_value )
VALUES ( 100, ‘测试price_band_category’, ‘测试price_band’, 1, 9999 )

– 修改记录操作
– 1.插入一条除dt_end(生命周期截止时间)字段被修改,其它字段和原有记录相同的数据。
INSERT INTO dc_base_price_band ( id, price_band_category, price_band, min_value, max_value, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, price_band_category, price_band, min_value, max_value, create_time, update_time, dt_begin, now( ), sign
FROM dc_base_price_band
WHERE id = 100 AND sign = 1 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ )

– 2.插入一条除sign(折叠表类型列)字段被修改,其它字段和原有记录相同的数据。
INSERT INTO dc_base_price_band ( id, price_band_category, price_band, min_value, max_value, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, price_band_category, price_band, min_value, max_value, create_time, update_time, dt_begin, dt_end, -1
FROM dc_base_price_band
WHERE id = 100 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ )

– 3.插入修改后数据(演示修改name字段的值)。
INSERT INTO dc_base_price_band ( id, price_band_category, price_band, min_value, max_value, create_time, update_time, dt_begin, dt_end, sign )
SELECT id, ‘测试price_band_category修改’, ‘测试price_band修改’, min_value, max_value, create_time, dt_end, dt_end, toDateTime ( ‘2100-01-01 00:00:00’ ), 1
FROM dc_base_price_band
WHERE id = 100 AND sign = 1 AND dt_end != toDateTime ( ‘2100-01-01 00:00:00’ )

– 4.合并数据。
OPTIMIZE TABLE dc_base_price_band FINAL

– 删除记录操作(执行“修改记录操作”中的步骤1、2、4。不执行步骤3 !!!)

CollapsingMergeTree表引擎(程序处理的业务数据表)
dc_dw_cust_trade_detail
因为dc_dw_cust_trade_detail表会记录删除、字段值会有更新,所以采用CollapsingMergeTree表引擎。
目前生产环境中每日站级订单数量在三十万左右,每月订单数量在一千万,我们按月做分区PARTITION BY substring(toString(trade_date), 1, 6)。
ClickHouse表结构:
CREATE TABLE edc181.dc_dw_cust_trade_detail
(
id Int64 COMMENT ‘主键’,
till_num Int64 COMMENT ‘站级订单id’,
day_batch_id Int32 COMMENT ‘营业日批次’,
pos_batch_id Nullable(Int32) COMMENT ‘收银班次号’,
cust_id Nullable(Int64) COMMENT ‘客户id’,
belong_city_code Nullable(String) COMMENT ‘归属分公司编码’,
category_code Nullable(Int32) COMMENT ‘商品分类’,
category_sub_code Nullable(Int32) COMMENT ‘商品子类’,
pay_code Nullable(Int32) COMMENT ‘支付方式编码’,
day_batch_date Nullable(String) COMMENT ‘营业日期’,
cust_magkey Nullable(String) DEFAULT ‘’ COMMENT ‘统建会员码’,
member_code Nullable(String) COMMENT ‘自建会员码’,
car_number Nullable(String) COMMENT ‘车牌号’,
org_code Nullable(String) COMMENT ‘机构编码’,
station_code String COMMENT ‘油站编码’,
pos_id Nullable(Int32) COMMENT ‘POS编号’,
app_id Nullable(Int64) COMMENT ‘渠道id’,
posted Nullable(String) COMMENT ‘实际交易时间’,
time_close Nullable(String) COMMENT ‘交易时间’,
bar_code Nullable(Int32) COMMENT ‘商品编码’,
item_name Nullable(String) COMMENT ‘商品名称’,
superdept_name Nullable(String) COMMENT ‘商品一级分类’,
dept_name Nullable(String) COMMENT ‘商品二级分类’,
sub_dept_name Nullable(String) COMMENT ‘商品三级分类’,
qty Nullable(Int32) COMMENT ‘购买数量’,
weight Nullable(String) COMMENT ‘购买升数’,
grade Nullable(Int32) COMMENT ‘是否非油’,
ori_price Nullable(Int32) COMMENT ‘单价’,
full_barcode Nullable(String) COMMENT ‘商品条码’,
pump_open_counter Nullable(String) COMMENT ‘起始泵码’,
pump_close_counter Nullable(String) COMMENT ‘终止泵码’,
pump_id Nullable(Int32) COMMENT ‘油枪号’,
total_amount Nullable(Int32) COMMENT ‘订单总金额 单位:分’,
discount_amount Int32 DEFAULT 0 COMMENT ‘折扣金额’,
pay_amount Nullable(Int32) COMMENT ‘实付金额’,
order_cat_amt Nullable(Int32) COMMENT ‘支付方式实付金额’,
ser_num Nullable(Int64) COMMENT ‘支付订单号’,
pm_subcode Nullable(Int32) COMMENT ‘支付类型’,
pmnt_name Nullable(String) COMMENT ‘支付名称’,
ssum_amount Nullable(Int32) COMMENT ‘支付金额’,
card_num Nullable(String) COMMENT ‘分片键日期’,
trade_date Int32 COMMENT ‘创建时间’,
create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
update_time DateTime DEFAULT now() COMMENT ‘修改时间’,
sign Int8 DEFAULT 1 COMMENT ‘折叠表类型列:1-状态行;-1-取消行’
)
ENGINE = CollapsingMergeTree(sign)
PARTITION BY substring(toString(trade_date), 1, 6)
ORDER BY (trade_date, station_code, day_batch_id, till_num, id, update_time)
SETTINGS index_granularity = 8192

Datax配置文件:
删除ClickHouse中已有的数据
{
“job”: {
“content”: [{
“reader”: {
“name”: “rdbmsreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:clickhouse://192.168.17.61:8123/edc181”],
“querySql”: [“SELECT id, till_num, day_batch_id, pos_batch_id, cust_id, belong_city_code, category_code, category_sub_code, pay_code, day_batch_date, cust_magkey, member_code, car_number, org_code, station_code, pos_id, app_id, posted, time_close, bar_code, item_name, superdept_name, dept_name, sub_dept_name, qty, weight, grade, ori_price, full_barcode, pump_open_counter, pump_close_counter, pump_id, total_amount, discount_amount, pay_amount, order_cat_amt, ser_num, pm_subcode, pmnt_name, ssum_amount, card_num, trade_date, create_time, update_time, - 1 AS SIGN FROM dc_dw_cust_trade_detail”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“till_num”,
“day_batch_id”,
“pos_batch_id”,
“cust_id”,
“belong_city_code”,
“category_code”,
“category_sub_code”,
“pay_code”,
“day_batch_date”,
“cust_magkey”,
“member_code”,
“car_number”,
“org_code”,
“station_code”,
“pos_id”,
“app_id”,
“posted”,
“time_close”,
“bar_code”,
“item_name”,
“superdept_name”,
“dept_name”,
“sub_dept_name”,
“qty”,
“weight”,
“grade”,
“ori_price”,
“full_barcode”,
“pump_open_counter”,
“pump_close_counter”,
“pump_id”,
“total_amount”,
“discount_amount”,
“pay_amount”,
“order_cat_amt”,
“ser_num”,
“pm_subcode”,
“pmnt_name”,
“ssum_amount”,
“card_num”,
“trade_date”,
“create_time”,
“update_time”,
“sign”
],
“fetchSize”: 1024
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“dc_dw_cust_trade_detail”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“till_num”,
“day_batch_id”,
“pos_batch_id”,
“cust_id”,
“belong_city_code”,
“category_code”,
“category_sub_code”,
“pay_code”,
“day_batch_date”,
“cust_magkey”,
“member_code”,
“car_number”,
“org_code”,
“station_code”,
“pos_id”,
“app_id”,
“posted”,
“time_close”,
“bar_code”,
“item_name”,
“superdept_name”,
“dept_name”,
“sub_dept_name”,
“qty”,
“weight”,
“grade”,
“ori_price”,
“full_barcode”,
“pump_open_counter”,
“pump_close_counter”,
“pump_id”,
“total_amount”,
“discount_amount”,
“pay_amount”,
“order_cat_amt”,
“ser_num”,
“pm_subcode”,
“pmnt_name”,
“ssum_amount”,
“card_num”,
“trade_date”,
“create_time”,
“update_time”,
“sign”
],
“postSql”: [“OPTIMIZE TABLE dc_dw_cust_trade_detail FINAL”],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

从MySQL向ClickHouse复制数据:
{
“job”: {
“content”: [{
“reader”: {
“name”: “mysqlreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:mysql://192.168.17.62:3306/zxj_edc181”],
“table”: [“dc_dw_cust_trade_detail”]
}],
“username”: “edcadmin”,
“password”: “hhyEDC2020”,
“column”: [
“id”,
“till_num”,
“day_batch_id”,
“pos_batch_id”,
“cust_id”,
“belong_city_code”,
“category_code”,
“category_sub_code”,
“pay_code”,
“day_batch_date”,
“cust_magkey”,
“member_code”,
“car_number”,
“org_code”,
“station_code”,
“pos_id”,
“app_id”,
“posted”,
“time_close”,
“bar_code”,
“item_name”,
“superdept_name”,
“dept_name”,
“sub_dept_name”,
“qty”,
“weight”,
“grade”,
“ori_price”,
“full_barcode”,
“pump_open_counter”,
“pump_close_counter”,
“pump_id”,
“total_amount”,
“discount_amount”,
“pay_amount”,
“order_cat_amt”,
“ser_num”,
“pm_subcode”,
“pmnt_name”,
“ssum_amount”,
“card_num”,
“trade_date”,
“create_time”
]
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“till_num”,
“day_batch_id”,
“pos_batch_id”,
“cust_id”,
“belong_city_code”,
“category_code”,
“category_sub_code”,
“pay_code”,
“day_batch_date”,
“cust_magkey”,
“member_code”,
“car_number”,
“org_code”,
“station_code”,
“pos_id”,
“app_id”,
“posted”,
“time_close”,
“bar_code”,
“item_name”,
“superdept_name”,
“dept_name”,
“sub_dept_name”,
“qty”,
“weight”,
“grade”,
“ori_price”,
“full_barcode”,
“pump_open_counter”,
“pump_close_counter”,
“pump_id”,
“total_amount”,
“discount_amount”,
“pay_amount”,
“order_cat_amt”,
“ser_num”,
“pm_subcode”,
“pmnt_name”,
“ssum_amount”,
“card_num”,
“trade_date”,
“create_time”
],
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181”,
“table”: [“dc_dw_cust_trade_detail”]
}],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

dc_dt_tag
因为dc_dt_tag表会记录删除、字段值会有更新,所以采用CollapsingMergeTree表引擎。
我们按标签分类做分区PARTITION BY tag_category。
ClickHouse表结构:
CREATE TABLE edc181.dc_dt_tag
(
cust_id UInt64 COMMENT ‘客户ID(联合主键)’,
tag_category String COMMENT ‘标签分类(联合主键)’,
tag_code String COMMENT ‘标签代码(联合主键)’,
tag_name Nullable(String) COMMENT ‘标签名称’,
tag_value Nullable(String) COMMENT ‘标签值’,
belong_city_code String COMMENT ‘归属分公司编码’,
original_date Nullable(Date) COMMENT ‘原始数据的日期’,
create_time DateTime DEFAULT now() COMMENT ‘创建时间’,
update_time DateTime DEFAULT now() COMMENT ‘最后更新时间’,
sign Int8 DEFAULT 1 COMMENT ‘折叠表类型列:1-状态行;-1-取消行’
)
ENGINE = CollapsingMergeTree(sign)
PARTITION BY tag_category
ORDER BY (belong_city_code, cust_id, tag_category, tag_code, update_time)
SETTINGS index_granularity = 8192

Datax配置文件:
删除ClickHouse中已有的数据
{
“job”: {
“content”: [{
“reader”: {
“name”: “rdbmsreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:clickhouse://192.168.17.61:8123/edc181”],
“querySql”: [“SELECT cust_id, tag_category, tag_code, tag_name, tag_value, belong_city_code, original_date, create_time, update_time, - 1 AS SIGN FROM dc_dt_tag”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“cust_id”,
“tag_category”,
“tag_code”,
“tag_name”,
“tag_value”,
“belong_city_code”,
“original_date”,
“create_time”,
“update_time”,
“sign”
],
“fetchSize”: 1024
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“dc_dt_tag”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“cust_id”,
“tag_category”,
“tag_code”,
“tag_name”,
“tag_value”,
“belong_city_code”,
“original_date”,
“create_time”,
“update_time”,
“sign”
],
“postSql”: [“OPTIMIZE TABLE dc_dt_tag FINAL”],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

从MySQL向ClickHouse复制数据:
{
“job”: {
“content”: [{
“reader”: {
“name”: “mysqlreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:mysql://192.168.17.62:3306/zxj_edc181”],
“table”: [“dc_dt_tag”]
}],
“username”: “edcadmin”,
“password”: “hhyEDC2020”,
“column”: [
“cust_id”,
“tag_category”,
“tag_code”,
“tag_name”,
“tag_value”,
“belong_city_code”,
“original_date”,
“create_time”,
“update_time”
]
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“dc_dt_tag”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“cust_id”,
“tag_category”,
“tag_code”,
“tag_name”,
“tag_value”,
“belong_city_code”,
“original_date”,
“create_time”,
“update_time”
],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

base_org
base_org这个表很恶心!!!
说这个表恶心的原因有两点:
1.tyhy_base库中的base_org表中虽然有create_time和update_time字段,但是表中的会有“create_time IS NULL AND update_time IS NOT NULL”和“create_time IS NOT NULL AND update_time IS NULL”的数据。这种情况create_time和update_time都是不可信的,因为可能改变的其它字段值,update_time为空或者update_time没更新。这样就没法做增量的数据同步。
2.系统设计阶段,code字段是不会轻易改变的。实际应用过程中大量数据记录的code字段值被修改过。且组织机构数据维护不及时,对数据中心的统计分析产生影响。
针对“问题2”,base_org表做拉链表设计。
针对“问题1”,程序还是每日做增量数据同步,如果发现edc181库和tyhy_base库中base_org在增量同步后仍有不一致的数据,则人工手动处理不一致的数据。

留个作业:
●手动处理参考dc_base_comm_category_code。
●Datax程序处理增量插入、更新的数据。
●有没有其它更好的方式?

ClickHouse表结构:
CREATE TABLE edc181.base_org
(
id UInt64 COMMENT ‘主键’,
parent_code Nullable(String) COMMENT ‘父机构编码’,
code Nullable(String) COMMENT ‘机构编码’,
name Nullable(String) COMMENT ‘机构名称’,
short_name Nullable(String) COMMENT ‘机构简称’,
level Nullable(String) COMMENT ‘机构层级’,
hos_code Nullable(String) COMMENT ‘零管编码’,
type Nullable(String) COMMENT ‘机构类型(1:油站;3:市公司;4:片区,参照规划院数据而来),\r\n可选择值取字典表中的机构类型。’,
invoice_mode Nullable(String) COMMENT ‘发票模式’,
invoice_type Nullable(String) COMMENT ‘发票类型’,
invoice_title Nullable(String) COMMENT ‘发票台头’,
address Nullable(String) COMMENT ‘详细地址’,
telephone Nullable(String) COMMENT ‘电话’,
business_hours Nullable(String) COMMENT ‘营业时间’,
longitude Nullable(String) COMMENT ‘经度’,
latitude Nullable(String) COMMENT ‘纬度’,
manager Nullable(String) COMMENT ‘站经理’,
ext_fld1 Nullable(String) COMMENT ‘扩展1’,
ext_fld2 Nullable(String) COMMENT ‘扩展2’,
ext_fld3 Nullable(String) COMMENT ‘零管编码对应的真实组织机构名称’,
create_by Nullable(Int64) COMMENT ‘创建人’,
create_time Nullable(Int64) COMMENT ‘创建时间’,
update_by Nullable(Int64) COMMENT ‘修改人’,
update_time Nullable(Int64) COMMENT ‘修改时间’,
order_num Nullable(Int32) COMMENT ‘排序’,
state Nullable(Int16) COMMENT ‘状态’,
deleted Nullable(Int16) COMMENT ‘是否删除’,
remark Nullable(String) COMMENT ‘备注是否删除’,
pic String DEFAULT ‘’ COMMENT ‘油站图片地址’,
dt_begin DateTime DEFAULT now() COMMENT ‘生命周期起始时间’,
dt_end DateTime DEFAULT toDateTime(‘2100-01-01 00:00:00’) COMMENT ‘生命周期截止时间’,
sign Int8 DEFAULT 1 COMMENT ‘折叠表类型列:1-状态行;-1-取消行’
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY (id, dt_end)
SETTINGS index_granularity = 8192

CREATE TABLE edc181.base_org_tmp
(
id UInt64 COMMENT ‘主键’,
parent_code Nullable(String) COMMENT ‘父机构编码’,
code Nullable(String) COMMENT ‘机构编码’,
name Nullable(String) COMMENT ‘机构名称’,
short_name Nullable(String) COMMENT ‘机构简称’,
level Nullable(String) COMMENT ‘机构层级’,
hos_code Nullable(String) COMMENT ‘零管编码’,
type Nullable(String) COMMENT ‘机构类型(1:油站;3:市公司;4:片区,参照规划院数据而来),\r\n可选择值取字典表中的机构类型。’,
invoice_mode Nullable(String) COMMENT ‘发票模式’,
invoice_type Nullable(String) COMMENT ‘发票类型’,
invoice_title Nullable(String) COMMENT ‘发票台头’,
address Nullable(String) COMMENT ‘详细地址’,
telephone Nullable(String) COMMENT ‘电话’,
business_hours Nullable(String) COMMENT ‘营业时间’,
longitude Nullable(String) COMMENT ‘经度’,
latitude Nullable(String) COMMENT ‘纬度’,
manager Nullable(String) COMMENT ‘站经理’,
ext_fld1 Nullable(String) COMMENT ‘扩展1’,
ext_fld2 Nullable(String) COMMENT ‘扩展2’,
ext_fld3 Nullable(String) COMMENT ‘零管编码对应的真实组织机构名称’,
create_by Nullable(Int64) COMMENT ‘创建人’,
create_time Nullable(Int64) COMMENT ‘创建时间’,
update_by Nullable(Int64) COMMENT ‘修改人’,
update_time Nullable(Int64) COMMENT ‘修改时间’,
order_num Nullable(Int32) COMMENT ‘排序’,
state Nullable(Int16) COMMENT ‘状态’,
deleted Nullable(Int16) COMMENT ‘是否删除’,
remark Nullable(String) COMMENT ‘备注是否删除’,
pic String DEFAULT ‘’ COMMENT ‘油站图片地址’
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 8192

维护数据的SQL示例:
– 添加记录操作
INSERT INTO base_org ( id, parent_code, code, name, short_name, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic, dt_begin, dt_end, sign )
VALUES ( 100, ‘900002’, ‘900002100’, ‘测试机构名称’, ‘测试机构简称’, ‘3’, ‘TEST’, ‘3’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, 1, toUnixTimestamp ( now( ) ) * 1000, 1, toUnixTimestamp ( now( ) ) * 1000, NULL, NULL, 0, ‘’, ‘’, now( ), toDateTime ( ‘2100-01-01 00:00:00’ ), 1 )

– 也可以省略 create_time, update_time, dt_begin, dt_end, sign 字段,这些字段有默认值。
INSERT INTO base_org ( id, parent_code, code, name, short_name, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic )
VALUES ( 100, ‘900002’, ‘900002100’, ‘测试机构名称’, ‘测试机构简称’, ‘3’, ‘TEST’, ‘3’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, ‘’, 1, toUnixTimestamp ( now( ) ) * 1000, 1, toUnixTimestamp ( now( ) ) * 1000, NULL, NULL, 0, ‘’, ‘’ )

– 修改记录操作
– 1.插入一条除dt_end(生命周期截止时间)字段被修改,其它字段和原有记录相同的数据。
INSERT INTO base_org ( id, parent_code, code, name, short_name, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic, dt_begin, dt_end, sign )
SELECT id, parent_code, code, name, short_name, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic, dt_begin, now( ), sign
FROM base_org
WHERE id = 100 AND sign = 1 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ )

– 2.插入一条除sign(折叠表类型列)字段被修改,其它字段和原有记录相同的数据。
INSERT INTO base_org ( id, parent_code, code, name, short_name, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic, dt_begin, dt_end, sign )
SELECT id, parent_code, code, name, short_name, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic, dt_begin, dt_end, -1
FROM base_org
WHERE id = 100 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ )

– 3.插入修改后数据(演示修改name字段的值)。
INSERT INTO base_org ( id, parent_code, code, name, short_name, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic, dt_begin, dt_end, sign )
SELECT id, parent_code, code, ‘测试机构名称修改’, ‘测试机构简称修改’, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic, dt_end, toDateTime ( ‘2100-01-01 00:00:00’ ), 1
FROM base_org
WHERE id = 100 AND sign = 1 AND dt_end != toDateTime ( ‘2100-01-01 00:00:00’ )

– 4.合并数据。
OPTIMIZE TABLE base_org FINAL

– 删除记录操作(执行“修改记录操作”中的步骤1、2、4。不执行步骤3 !!!)

Datax配置文件:
从MySQL向ClickHouse的tmp表复制增量数据
{
“job”: {
“content”: [{
“reader”: {
“name”: “mysqlreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:mysql://192.168.17.62:3306/zxj_edc181”],
“querySql”: [“SELECT * FROM base_org WHERE create_time >= (UNIX_TIMESTAMP(‘ p r o c e s s d a t e ′ ) ∗ 1000 ) O R u p d a t e t i m e > = ( U N I X T I M E S T A M P ( ′ {process_date}') * 1000) OR update_time >= (UNIX_TIMESTAMP(' processdate)1000)ORupdatetime>=(UNIXTIMESTAMP({process_date}’) * 1000)”]
}],
“username”: “edcadmin”,
“password”: “hhyEDC2020”,
“column”: [
“id”,
“parent_code”,
“code”,
“name”,
“short_name”,
“level”,
“hos_code”,
“type”,
“invoice_mode”,
“invoice_type”,
“invoice_title”,
“address”,
“telephone”,
“business_hours”,
“longitude”,
“latitude”,
“manager”,
“ext_fld1”,
“ext_fld2”,
“ext_fld3”,
“create_by”,
“create_time”,
“update_by”,
“update_time”,
“order_num”,
“state”,
“deleted”,
“remark”,
“pic”
]
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“base_org_tmp”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“parent_code”,
“code”,
“name”,
“short_name”,
“level”,
“hos_code”,
“type”,
“invoice_mode”,
“invoice_type”,
“invoice_title”,
“address”,
“telephone”,
“business_hours”,
“longitude”,
“latitude”,
“manager”,
“ext_fld1”,
“ext_fld2”,
“ext_fld3”,
“create_by”,
“create_time”,
“update_by”,
“update_time”,
“order_num”,
“state”,
“deleted”,
“remark”,
“pic”
],
“preSql”: [“TRUNCATE TABLE base_org_tmp”],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

修改原有数据的声明周期字段
{
“job”: {
“content”: [{
“reader”: {
“name”: “rdbmsreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:clickhouse://192.168.17.61:8123/edc181”],
“querySql”: [“SELECT bo.id, bo.parent_code, bo.code, bo.name, bo.short_name, bo.level, bo.hos_code, bo.type, bo.invoice_mode, bo.invoice_type, bo.invoice_title, bo.address, bo.telephone, bo.business_hours, bo.longitude, bo.latitude, bo.manager, bo.ext_fld1, bo.ext_fld2, bo.ext_fld3, bo.create_by, bo.create_time, bo.update_by, bo.update_time, bo.order_num, bo.state, bo.deleted, bo.remark, bo.pic, bo.dt_begin AS dt_begin, IF( bot.update_time IS NULL, toDateTime( fromUnixTimestamp64Milli ( bot.create_time ) ), toDateTime( fromUnixTimestamp64Milli ( bot.update_time ) ) ) AS dt_end, 1 AS sign FROM ( SELECT * FROM base_org WHERE sign = 1 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ ) ) bo JOIN base_org_tmp bot ON bot.id = bo.id”]
}],
“username”: “licl”,
“password”: “123456”,
“fetchSize”: 1024
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“base_org”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“parent_code”,
“code”,
“name”,
“short_name”,
“level”,
“hos_code”,
“type”,
“invoice_mode”,
“invoice_type”,
“invoice_title”,
“address”,
“telephone”,
“business_hours”,
“longitude”,
“latitude”,
“manager”,
“ext_fld1”,
“ext_fld2”,
“ext_fld3”,
“create_by”,
“create_time”,
“update_by”,
“update_time”,
“order_num”,
“state”,
“deleted”,
“remark”,
“pic”,
“dt_begin”,
“dt_end”,
“sign”
],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

修改原有数据的sign字段
{
“job”: {
“content”: [{
“reader”: {
“name”: “rdbmsreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:clickhouse://192.168.17.61:8123/edc181”],
“querySql”: [“SELECT bo.id, bo.parent_code, bo.code, bo.name, bo.short_name, bo.level, bo.hos_code, bo.type, bo.invoice_mode, bo.invoice_type, bo.invoice_title, bo.address, bo.telephone, bo.business_hours, bo.longitude, bo.latitude, bo.manager, bo.ext_fld1, bo.ext_fld2, bo.ext_fld3, bo.create_by, bo.create_time, bo.update_by, bo.update_time, bo.order_num, bo.state, bo.deleted, bo.remark, bo.pic, bo.dt_begin AS dt_begin, bo.dt_end AS dt_end, -1 AS sign FROM ( SELECT * FROM base_org WHERE sign = 1 AND dt_end = toDateTime ( ‘2100-01-01 00:00:00’ ) ) bo JOIN base_org_tmp bot ON bot.id = bo.id”]
}],
“username”: “licl”,
“password”: “123456”,
“fetchSize”: 1024
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“base_org”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“parent_code”,
“code”,
“name”,
“short_name”,
“level”,
“hos_code”,
“type”,
“invoice_mode”,
“invoice_type”,
“invoice_title”,
“address”,
“telephone”,
“business_hours”,
“longitude”,
“latitude”,
“manager”,
“ext_fld1”,
“ext_fld2”,
“ext_fld3”,
“create_by”,
“create_time”,
“update_by”,
“update_time”,
“order_num”,
“state”,
“deleted”,
“remark”,
“pic”,
“dt_begin”,
“dt_end”,
“sign”
],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

插入新增的数据和修改后的数据
{
“job”: {
“content”: [{
“reader”: {
“name”: “rdbmsreader”,
“parameter”: {
“connection”: [{
“jdbcUrl”: [“jdbc:clickhouse://192.168.17.61:8123/edc181”],
“querySql”: [“SELECT id, parent_code, code, name, short_name, level, hos_code, type, invoice_mode, invoice_type, invoice_title, address, telephone, business_hours, longitude, latitude, manager, ext_fld1, ext_fld2, ext_fld3, create_by, create_time, update_by, update_time, order_num, state, deleted, remark, pic, IF( update_time IS NULL, toDateTime( fromUnixTimestamp64Milli ( create_time ) ), toDateTime( fromUnixTimestamp64Milli ( update_time ) ) ) AS dt_begin, toDateTime ( ‘2100-01-01 00:00:00’ ) AS dt_end, 1 AS sign FROM base_org_tmp”]
}],
“username”: “licl”,
“password”: “123456”,
“fetchSize”: 1024
}
},
“writer”: {
“name”: “clickhousewriter”,
“parameter”: {
“connection”: [{
“jdbcUrl”: “jdbc:clickhouse://192.168.17.61:8123/edc181?socket_timeout=300000”,
“table”: [“base_org”]
}],
“username”: “licl”,
“password”: “123456”,
“column”: [
“id”,
“parent_code”,
“code”,
“name”,
“short_name”,
“level”,
“hos_code”,
“type”,
“invoice_mode”,
“invoice_type”,
“invoice_title”,
“address”,
“telephone”,
“business_hours”,
“longitude”,
“latitude”,
“manager”,
“ext_fld1”,
“ext_fld2”,
“ext_fld3”,
“create_by”,
“create_time”,
“update_by”,
“update_time”,
“order_num”,
“state”,
“deleted”,
“remark”,
“pic”,
“dt_begin”,
“dt_end”,
“sign”
],
“postSql”: [“OPTIMIZE TABLE base_org FINAL”],
“batchSize”: 65536,
“batchByteSize”: 134217728,
“dryRun”: false,
“writeMode”: “insert”
}
}
}],
“setting”: {
“speed”: {
“channel”: 5
}
}
}
}

前提:此方法仅适用于可以把表中的更新和插入的数据筛选出来的表。
实在没有的话,就只能进行所有字段的关联比较出不同的记录了。

此文章是来说下 ClickHouse 的折叠表引擎与数据仓库中常用的拉链表的结合使用。
当然也不是所有业务场景的表都适合,如下:
1.表中没有 update_time 字段的,也就是说没法知道数据更新和插入的时间,要是有binlog也行;
2.需要有个字段能把数据分段,这个可能也会做为分区用,总之,就是可进行周期处理的。

版本:ClickHouse 21.4.6.55
SELECT * FROM system.build_options

准备环境
创建表

首先来创建几个表:

表1:MySQL 中的订单表
CREATE TABLE order (
id bigint(20) NOT NULL COMMENT ‘主键’,
status varchar(20) COMMENT ‘订单状态’,
create_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT ‘创建时间’,
update_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT ‘更新时间’,
PRIMARY KEY (id)
)

表2:ClickHouse 中的订单历史临时表
CREATE TABLE order_tmp (
id Int64 COMMENT ‘主键’,
status String COMMENT ‘订单状态’,
create_time DateTime COMMENT ‘创建时间’,
update_time DateTime COMMENT ‘更新时间’
)
ENGINE = MergeTree
ORDER BY (id)

表3:ClickHouse 中的订单历史表
CREATE TABLE order_hist (
id Int64 COMMENT ‘主键’,
status String COMMENT ‘订单状态’,
create_time DateTime COMMENT ‘创建时间’,
update_time DateTime COMMENT ‘更新时间’,
start_date Date COMMENT ‘有效开始日期’,
end_date Date COMMENT ‘有效结束日期’,
sign Int8
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY (id, start_date)

模拟场景

要模拟的业务场景:
同步一:表1 中有1条 订单数据,把这个数据经过几个步骤最终同步到 表3 中。
同步二:表1 中更新上面那1条数据,再放入2条新的数据,再次使用同步。

同步一

任务目标:将表1 中 update_time 字段日期等于 2021-07-05 的数据迁移 表3 中。
执行时间:2021-07-06 01:00:00
日期参数:2021-07-05
准备1:表1 中增加一条数据
创建1条订单记录
INSERT INTO order VALUES (‘1’, ‘0’, ‘2021-07-05 19:48:20’, ‘2021-07-05 19:48:20’);

准备2:将 表1 中的数据同步到 表2 中
按任务目标查询 表1 中 update_time 字段日期等于 2021-07-05
SELECT * FROM order WHERE update_time >= ‘ 日期参数 ′ A N D u p d a t e t i m e < = ′ {日期参数}' AND update_time <= ' 日期参数ANDupdatetime<={日期参数}’;
表2 为临时表,每次用之前要先清空下
TRUNCATE TABLE order_tmp;
模拟 datax 等工具把查询到的数据同步到 表2 中
INSERT INTO order_tmp VALUES (‘1’, ‘0’, toDateTime(‘2021-07-05 19:48:20’), toDateTime(‘2021-07-05 19:48:20’));

操作1:更新数据:被更新数据sign标记-1
通过使用表3关联表2的方式找到 已经有了并且有效的数据进行 -1 标记,就是更新数据中隐藏的删除操作(找出需要更新的数据)
如果实在无法找到被更新的数据,可能就只能用所有字段关联的方式了
INSERT INTO order_hist
SELECT id, status, create_time, update_time, start_date, end_date, -1
FROM order_hist hist
INNER JOIN order_tmp tmp
ON tmp.id = hist.id
WHERE end_date = ‘2148-12-31’;

操作2:更新数据:标记有效结束日期
修改被更新记录的有效结束时间,这标记记录有效 结束日期是个关键,正常运行和补数重跑的时候都要考虑进去,这里使用日期参数作为 记录有效结束日期:
INSERT INTO order_hist
SELECT id, status, create_time, update_time, start_date, addDays(toDate(‘${日期参数}’) , -1) , 1
FROM order_hist hist
INNER JOIN order_tmp tmp
ON tmp.id = hist.id
WHERE hist.end_date = ‘2148-12-31’;

操作3:插入数据:放入最新的记录
将新记录放入历史表中
INSERT INTO order_hist
SELECT *, ‘${日期参数}’ , ‘2148-12-31’ , 1
FROM order_tmp;
注意:当前版本Date最远可到2149年,但全年日期的支持是2148年。如果年份超出 2149 数据将溢出到 1970 年。

操作完 表3 中的数据如下:
id status create_time update_time start_date end_date sign
1 0 2021-7-5 19:48 2021-7-5 19:48 2021-7-5 2148-12-31 1

同步二

任务目标:将表1 中 update_time 字段日期等于 2021-07-07 的数据迁移 表3 中。
执行时间:2021-07-07 01:00:00
日期参数:2021-07-06

准备1:表1中修改1条增加2条数据

update order set status = 1, update_time = ‘2021-07-06 13:12:43’ where id = 1;
INSERT INTO order VALUES (‘2’, ‘0’, ‘2021-07-06 19:48:27’, ‘2021-07-06 19:48:27’);
INSERT INTO order VALUES (‘3’, ‘0’, ‘2021-07-06 19:48:34’, ‘2021-07-06 19:48:34’);
表数据如下:
id status create_time update_time
1 1 2021-7-5 19:48 2021-7-6 13:12
2 0 2021-7-6 19:48 2021-7-6 19:48
3 0 2021-7-6 19:48 2021-7-6 19:48

准备2:将 表1 中的数据同步到 表2 中
按任务目标查询 表1 中 update_time 字段日期等于 2021-07-05
SELECT * FROM order WHERE update_time >= ‘ 日期参数 ′ A N D u p d a t e t i m e < = ′ {日期参数}' AND update_time <= ' 日期参数ANDupdatetime<={日期参数}’;
表2 为临时表,每次用之前要先清空下
TRUNCATE TABLE order_tmp;
模拟 datax 等工具把查询到的数据同步到 表2 中
INSERT INTO order_tmp (id, status, create_time, update_time) VALUES (‘1’, ‘1’, toDateTime(‘2021-07-05 19:48:20’), toDateTime(‘2021-07-06 13:12:43’));
INSERT INTO order_tmp (id, status, create_time, update_time) VALUES (‘2’, ‘0’, toDateTime(‘2021-07-06 19:48:27’), toDateTime(‘2021-07-06 19:48:27’));
INSERT INTO order_tmp (id, status, create_time, update_time) VALUES (‘3’, ‘0’, toDateTime(‘2021-07-06 19:48:34’), toDateTime(‘2021-07-06 19:48:34’));

操作1:更新数据:被更新数据sign标记-1
通过使用表3关联表2的方式找到 已经有了并且有效的数据进行 -1 标记,就是更新数据中隐藏的删除操作(找出需要更新的数据)
如果实在无法找到被更新的数据,可能就只能用所有字段关联的方式了
INSERT INTO order_hist
SELECT id, status, create_time, update_time, start_date, end_date, -1
FROM order_hist hist
INNER JOIN order_tmp tmp
ON tmp.id = hist.id
WHERE end_date = ‘2148-12-31’;

操作2:更新数据:标记有效结束日期
修改被更新记录的有效结束时间,这标记记录有效 结束日期是个关键,正常运行和补数重跑的时候都要考虑进去,这里使用日期参数作为 记录有效结束日期:
INSERT INTO order_hist
SELECT id, status, create_time, update_time, start_date, addDays(toDate(‘${日期参数}’) , -1) , 1
FROM order_hist hist
INNER JOIN order_tmp tmp
ON tmp.id = hist.id
WHERE hist.end_date = ‘2148-12-31’;

操作3:插入数据:放入最新的记录
将新记录放入历史表中
INSERT INTO order_hist
SELECT *, ‘${日期参数}’ , ‘2148-12-31’ , 1
FROM order_tmp;
注意:当前版本Date最远可到2149年,但全年日期的支持是2148年。如果年份超出 2149 数据将溢出到 1970 年。

OPTIMIZE 前
id status create_time update_time start_date end_date sign
1 1 2021-07-05 19:48 2021-07-06 13:12 2021-07-07 2148-12-31 1
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-06 2021-07-07 1
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-06 2021-07-06 1
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-06 2148-12-31 1
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-06 2148-12-31 -1
2 0 2021-07-06 19:48 2021-07-06 19:48 2021-07-07 2148-12-31 1
3 0 2021-07-06 19:48 2021-07-06 19:48 2021-07-07 2148-12-31 1

id status create_time update_time start_date end_date sign
1 1 2021-07-05 19:48 2021-07-06 13:12 2021-07-06 2148-12-31 1
2 0 2021-07-06 19:48 2021-07-06 19:48 2021-07-06 2148-12-31 1
3 0 2021-07-06 19:48 2021-07-06 19:48 2021-07-06 2148-12-31 1
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-05 2148-12-31 1
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-05 2148-12-31 -1
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-05 2021-07-07 1
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-05 2021-07-06 1

OPTIMIZE 后
id status create_time update_time start_date end_date sign
1 0 2021-07-05 19:48 2021-07-05 19:48 2021-07-05 2021-07-05 1
1 1 2021-07-05 19:48 2021-07-06 13:12 2021-07-06 2148-12-31 1
2 0 2021-07-06 19:48 2021-07-06 19:48 2021-07-06 2148-12-31 1
3 0 2021-07-06 19:48 2021-07-06 19:48 2021-07-06 2148-12-31 1

特殊情况
上面的这种操作方式,在下面几种特殊情况下的效果:
1.已经处理的最近日期重跑:没有问题,只是相当于把数据重新处理了一遍;
2.已经处理的最近日期之前的数据重跑:可以重新跑,但在之后的日期跑处理的历史记录会标记为失效,所以要把后面的都重新跑一下;
3.中间断天了,重跑断的这天:还没试过,估计和上面的第二条是一个效果。

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Mathcad 是一种功能强大的数学软件,它可以让我们非常方便地进行数学计算和分析。学习使用Mathcad 对我来说是一种挑战,但同时也是一次非常有意义的经历。 首先,通过学习Mathcad,我深刻认识到数学公式和计算的重要性。在过去,我常常依赖计算器和纸笔进行数学计算,但这种方式往往效率不高,并且难以避免出错。而Mathcad 提供了一个直观且可视化的界面,让我们可以更加直观地理解和应用数学原理。 其次,学习Mathcad 也提高了我的数学建模和问题解决能力。在学习过程中,我发现Mathcad 的强大之处在于它能够将数学公式和实际问题相结合。通过输入不同的数学公式和数据,我可以揭示和解决一些实际世界中的问题,例如电路分析、力学运动、概率统计等,这为我提供了宝贵的数学建模实践经验。 此外,通过学习Mathcad,我也掌握了一种全新的数学表达方式。通过使用Mathcad的函数和符号库,我可以更加准确地表达数学概念和计算过程。这种表达方式简洁明了,易于理解和阅读,有助于提高我的数学思维能力和表达能力。 尽管学习Mathcad可能会遇到一些挑战,例如复杂的函数和符号操作,对于初学者来说不太友好,但通过不断实践和学习,我相信我可以克服这些问题,并不断进步。 总的来说,学习Mathcad 是一次充满挑战和收获的过程。它帮助我提高了数学知识和技能,提升了我的数学建模和问题解决能力。我相信在未来的学习和工作中,Mathcad 将成为我强大的数学工具。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值