需求
- 统计系统数据,日均数据量:100亿
- topic:16个
- 落库clickhouse,分布式表
- 支持数据采样查询:SAMPLE BY(下面详解)
- 预计设计随机列作为SAMP BY 对象,但是clickhouse---->20.5版本不支持,估计以后也不会支持
创建分布式表
-
引擎:GraphiteMergeTree随机数生成器,属于MergeTree子类
-
使用方法:CK链接
ENGINE = GenerateRandom(random_seed, max_string_length, max_array_length)
-
但是此表引擎不适用SAMPLE BY
-
继续选用MergeTree创建分布式表:MergeTree
-
语句
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] ( name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], ... INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1, INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2 ) ENGINE = MergeTree() [PARTITION BY expr] #分区键 [ORDER BY expr] #表的排序键 [PRIMARY KEY expr] #主键,如果要设置,跟排序键不相同 [SAMPLE BY expr] #用于抽样的表达式,数据抽样 [SETTINGS name=value, ...] #影响 MergeTree 性能的额外参数,这个参数CK有详细解释
-
分布式创建
-
指定zookeeper地址:
ENGINE =ReplicatedMergeTree('zoo_path/table_name', 'replica_name') 例: ENGINE =ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/table_name', '{replica}')
-
需要加入zookeeper集群地址来进行管理元数据,需要ZooKeeper 3.4.5或以上
-
关于zookeeper的用途CK给了更好的解释zookeeper在clickhouse中的应用
-
建表
CREATE TABLE dmp.ssp_census_01 ON CLUSTER ssp_census_cluster(request_id String,date Date,date_hour String,adx_id String,publish_id String,media_id String,pos_id String,dsp_id String,dsp_pos_id String,ad_id String,oem_id String,token String,media_type String,settle_type String,platform String,system_type String,slot_type String,slot_tag String,media_tag String,version String,plant_id String,cost_time Int64) ENGINE=ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/ssp_census_01', '{replica}') PARTITION BY toYYYYMMDD(date) ORDER BY (date,date_hour,adx_id,ad_id,intHash32(request_id)) SAMPLE BY intHash32(request_id) CREATE TABLE dmp.ssp_censun_all(分布式表名) ON CLUSTER ssp_census_cluster AS dmp.ssp_census_01(本地表) ENGINE = Distributed(ssp_census_cluster, dmp, ssp_census_01, rand())
-
-
SAMPLE BY
-
使用sample语句,需要启用数据采样,CK链接
-
启用数据采样:sample by(column),其中column必须为主键
-
启用数据采样后可在查询语句中指定
SELECT Title, count() * 10 AS PageViews FROM hits_distributed SAMPLE 0.1 WHERE CounterID = 34 GROUP BY Title ORDER BY PageViews DESC LIMIT 1000
-
以上语句只是针对于要求的所有数据的随机10%进行执行,
-
SAMPLE K OFFSET M,K,M都为0-1的小数,表示从要求数据中的后100*M%数据中随机抽取100*K%进行采样
-