ClickHouse

最新推荐文章于 2023-11-14 19:19:56 发布

超短腿迪迦

最新推荐文章于 2023-11-14 19:19:56 发布

阅读量307

点赞数 2

分类专栏： clickhouse 大数据文章标签： clickhouse 大数据数据库

本文链接：https://blog.csdn.net/m0_65077254/article/details/130117736

版权

大数据同时被 2 个专栏收录

21 篇文章 0 订阅

订阅专栏

clickhouse

1 篇文章 0 订阅

订阅专栏

ClickHouse

介绍

Yandx开源的一个用于实时数据分析的基于列存储的数据库

超过目前市面上可比的面向列的DBMS

1.特性

快速，容错，可伸缩，易用，

充分利用硬件:列式储存格式允许在RAM中容纳更多热数据从而缩短了响应时间

提高cpu效率，优化磁盘访问（可以最大程度减少范围查询次数）

安装包含义

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iFRkwBnC-1681307558081)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411193955164.png)]

设置hosts文件解析，ck同步表数据会去zookeeper查询，zookeeper储存的是hosts域名不是ip，不解析会导致同步表但是无法同步数据

2.安装步骤

其他安装方式可以参考官网

1.解压

依此将四个安装包解压，并且执行install下的doinst.sh脚本（按顺序）

clickhouse-common-static-21.9.4.35.tgz
clickhouse-common-static-dbg-21.9.4.35.tgz 
clickhouse-server-21.9.4.35.tgz
clickhouse-client-21.9.4.35.tgz

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C4CY2N7U-1681307558082)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411093027442.png)]

2.运行doinst.sh

在解压clickhouse-server-21.9.4.35并运行 install doinst.sh时，会默认创建一个default用户，并设置密码(不需要可以直接回车)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vyI8qF8T-1681307558082)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411093424076.png)]

3.启动

(可通过clickhouse --help 查看帮忙)

clickhouse start

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EnYdUZ69-1681307558082)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411093818142.png)]

4.连接clickhouse

如果在步骤2设置了密码

clickhouse-client --passsword回车输入密码

3.示例

查库命令

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YHh4UUNi-1681307558083)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411094426309.png)]

配置解析

config.xml：服务端配置

users.xml：用户配置

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YWRbChsu-1681307558083)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411091437356.png)]

其他启动命令

systemctl start clickhouse-server.service

systemctl enable clickhouse-server.service
查库
clickhouse-client --query "show databases"

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pH2Qyvgs-1681307558083)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230410094759996.png)]

-------------------------------------------------安装完成------------------------------------------

远程访问

将listen的注释打开，并重启clickhouse

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Yvnzl13M-1681307558083)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411100131873.png)]

远程访问成功

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G5IFLpNg-1681307558084)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411100358650.png)]

(38条消息) ClickHouse安装部署【非常详细】_苡~的博客-CSDN博客

官网

Hive | ClickHouse Docs

[ALL Clause | ClickHouse Docs

[Install ClickHouse | ClickHouse Docs](https://clickhouse.com/docs/en/sql-reference/statements/select/all)

语法

选择查询

[WITH expr_list|(subquery)]
SELECT [DISTINCT [ON (column1, column2, ...)]] expr_list
[FROM [db.]table | (subquery) | table_function] [FINAL]
[SAMPLE sample_coeff]
[ARRAY JOIN ...]
[GLOBAL] [ANY|ALL|ASOF] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI] JOIN (subquery)|table (ON <expr_list>)|(USING <column_list>)
[PREWHERE expr]
[WHERE expr]
[GROUP BY expr_list] [WITH ROLLUP|WITH CUBE] [WITH TOTALS]
[HAVING expr]
[ORDER BY expr_list] [WITH FILL] [FROM expr] [TO expr] [STEP expr] [INTERPOLATE [(expr_list)]]
[LIMIT [offset_value, ]n BY columns]
[LIMIT [n, ]m] [WITH TIES]
[SETTINGS ...]
[UNION  ...]
[INTO OUTFILE filename [COMPRESSION type [LEVEL level]] ]
[FORMAT format]

创表

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [ALIAS expr1],
    name2 [type2] [ALIAS expr2],
    ...
) ENGINE = Hive('thrift://host:port', 'database', 'table');//与hive集成，可参考后面
PARTITION BY expr

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-H7kHPmJA-1681307558084)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411105844642.png)]

与hive集成

暂时不可用

不能直接复制黏贴多行语句时，在连接clickhouse时加-m

clickhouse-client -m

对hdfs系统使用本地缓存

<local_cache_for_remote_fs>
ClickHouse 将在启动后维护远程文件系统 （HDFS） 的本地缓存
    <enable>true</enable>
    用于存储远程文件系统的本地缓存文件的根目录。
    <root_dir>local_cache</root_dir>
    本地缓存文件的最大大小（以字节为单位）。
    <limit_size>559096952</limit_size>
   从远程文件系统下载文件时，在刷新到本地文件系统之前控制字节。默认值为 1MB。 <bytes_read_before_flush>1048576</bytes_read_before_flush>
</local_cache_for_remote_fs>

当 ClickHouse 在启用远程文件系统的本地缓存的情况下启动时，用户仍然可以选择在查询中不使用缓存。处于默认值。

`settings use_local_cache_for_remote_fs0``

use_local_cache_for_remote_fsfalse`

1.使用orc输入查询配置单元表

1.hive创建表

hive > CREATE TABLE `test`.`test_orc`(
  `f_tinyint` tinyint,
  `f_smallint` smallint,
  `f_int` int,
  `f_integer` int,
  `f_bigint` bigint,
  `f_float` float,
  `f_double` double,
  `f_decimal` decimal(10,0),
  `f_timestamp` timestamp,
  `f_date` date,
  `f_string` string,
  `f_varchar` varchar(100),
  `f_bool` boolean,
  `f_binary` binary,
  `f_array_int` array<int>,
  `f_array_string` array<string>,
  `f_array_float` array<float>,
  `f_array_array_int` array<array<int>>,
  `f_array_array_string` array<array<string>>,
  `f_array_array_float` array<array<float>>)
PARTITIONED BY (
  `day` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://testcluster/data/hive/test.db/test_orc'

OK
Time taken: 0.51 seconds

hive > insert into test.test_orc partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));
OK
Time taken: 36.025 seconds

hive > select * from test.test_orc;
OK
1   2   3   4   5   6.11    7.22    8   2021-11-05 12:38:16.314 2021-11-05  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18
Time taken: 0.295 seconds, Fetched: 1 row(s)

clickhouse创建表

CREATE TABLE test.test_orc
(
    `f_tinyint` Int8,
    `f_smallint` Int16,
    `f_int` Int32,
    `f_integer` Int32,
    `f_bigint` Int64,
    `f_float` Float32,
    `f_double` Float64,
    `f_decimal` Float64,
    `f_timestamp` DateTime,
    `f_date` Date,
    `f_string` String,
    `f_varchar` String,
    `f_bool` Bool,
    `f_binary` String,
    `f_array_int` Array(Int32),
    `f_array_string` Array(String),
    `f_array_float` Array(Float32),
    `f_array_array_int` Array(Array(Int32)),
    `f_array_array_string` Array(Array(String)),
    `f_array_array_float` Array(Array(Float32)),
    `day` String
)
ENGINE = Hive('thrift://202.168.117.26:9083', 'test', 'test_orc')
PARTITION BY day

从上面创建的hive表检索数据

SELECT * FROM test.test_orc settings input_format_orc_allow_missing_columns = 1\G

SELECT *
FROM test.test_orc
SETTINGS input_format_orc_allow_missing_columns = 1

Query id: c3eaffdc-78ab-43cd-96a4-4acc5b480658

Row 1:
──────
f_tinyint:            1
f_smallint:           2
f_int:                3
f_integer:            4
f_bigint:             5
f_float:              6.11
f_double:             7.22
f_decimal:            8
f_timestamp:          2021-12-04 04:00:44
f_date:               2021-12-03
f_string:             hello world
f_varchar:            hello world
f_bool:               true
f_binary:             hello world
f_array_int:          [1,2,3]
f_array_string:       ['hello world','hello world']
f_array_float:        [1.1,1.2]
f_array_array_int:    [[1,2],[3,4]]
f_array_array_string: [['a','b'],['c','d']]
f_array_array_float:  [[1.11,2.22],[3.33,4.44]]
day:                  2021-09-18


1 rows in set. Elapsed: 0.078 sec.

2.使用文本输入格式查询配置单元表

hive

hive >
CREATE TABLE `test`.`test_text`(
  `f_tinyint` tinyint,
  `f_smallint` smallint,
  `f_int` int,
  `f_integer` int,
  `f_bigint` bigint,
  `f_float` float,
  `f_double` double,
  `f_decimal` decimal(10,0),
  `f_timestamp` timestamp,
  `f_date` date,
  `f_string` string,
  `f_varchar` varchar(100),
  `f_char` char(100),
  `f_bool` boolean,
  `f_binary` binary,
  `f_array_int` array<int>,
  `f_array_string` array<string>,
  `f_array_float` array<float>,
  `f_array_array_int` array<array<int>>,
  `f_array_array_string` array<array<string>>,
  `f_array_array_float` array<array<float>>)
PARTITIONED BY (
  `day` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://testcluster/data/hive/test.db/test_text'
Time taken: 0.1 seconds, Fetched: 34 row(s)


hive >  insert into test.test_text partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));
OK
Time taken: 36.025 seconds

hive > select * from test.test_text;
OK
1   2   3   4   5   6.11    7.22    8   2021-12-14 18:11:17.239 2021-12-14  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18
Time taken: 0.624 seconds, Fetched: 1 row(s)

clickhouse

CREATE TABLE test.test_text
(
    `f_tinyint` Int8,
    `f_smallint` Int16,
    `f_int` Int32,
    `f_integer` Int32,
    `f_bigint` Int64,
    `f_float` Float32,
    `f_double` Float64,
    `f_decimal` Float64,
    `f_timestamp` DateTime,
    `f_date` Date,
    `f_string` String,
    `f_varchar` String,
    `f_char` String,
    `f_bool` Bool,
    `day` String
)
ENGINE = Hive('thrift://localhost:9083', 'test', 'test_text')
PARTITION BY day

查询

SELECT * FROM clickhouse.test_text settings input_format_skip_unknown_fields = 1, input_format_with_names_use_header = 1, date_time_input_format = 'best_effort'\G

SELECT *
FROM test.test_text
SETTINGS input_format_skip_unknown_fields = 1, input_format_with_names_use_header = 1, date_time_input_format = 'best_effort'

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PdUUSyIc-1681307558085)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411105643319.png)]

3.parquet格式

hive

hive >
CREATE TABLE `test`.`test_parquet`(
  `f_tinyint` tinyint,
  `f_smallint` smallint,
  `f_int` int,
  `f_integer` int,
  `f_bigint` bigint,
  `f_float` float,
  `f_double` double,
  `f_decimal` decimal(10,0),
  `f_timestamp` timestamp,
  `f_date` date,
  `f_string` string,
  `f_varchar` varchar(100),
  `f_char` char(100),
  `f_bool` boolean,
  `f_binary` binary,
  `f_array_int` array<int>,
  `f_array_string` array<string>,
  `f_array_float` array<float>,
  `f_array_array_int` array<array<int>>,
  `f_array_array_string` array<array<string>>,
  `f_array_array_float` array<array<float>>)
PARTITIONED BY (
  `day` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://testcluster/data/hive/test.db/test_parquet'
OK
Time taken: 0.51 seconds

hive >  insert into test.test_parquet partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));
OK
Time taken: 36.025 seconds

hive > select * from test.test_parquet;
OK
1   2   3   4   5   6.11    7.22    8   2021-12-14 17:54:56.743 2021-12-14  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18
Time taken: 0.766 seconds, Fetched: 1 row(s)

CREATE TABLE test.test_parquet
(
    `f_tinyint` Int8,
    `f_smallint` Int16,
    `f_int` Int32,
    `f_integer` Int32,
    `f_bigint` Int64,
    `f_float` Float32,
    `f_double` Float64,
    `f_decimal` Float64,
    `f_timestamp` DateTime,
    `f_date` Date,
    `f_string` String,
    `f_varchar` String,
    `f_char` String,
    `f_bool` Bool,
    `f_binary` String,
    `f_array_int` Array(Int32),
    `f_array_string` Array(String),
    `f_array_float` Array(Float32),
    `f_array_array_int` Array(Array(Int32)),
    `f_array_array_string` Array(Array(String)),
    `f_array_array_float` Array(Array(Float32)),
    `day` String
)
ENGINE = Hive('thrift://localhost:9083', 'test', 'test_parquet')
PARTITION BY day

查询

SELECT * FROM test.test_parquet settings input_format_parquet_allow_missing_columns = 1\G

SELECT *
FROM test_parquet
SETTINGS input_format_parquet_allow_missing_columns = 1

Query id: 4e35cf02-c7b2-430d-9b81-16f438e5fca9

Row 1:
──────
f_tinyint:            1
f_smallint:           2
f_int:                3
f_integer:            4
f_bigint:             5
f_float:              6.11
f_double:             7.22
f_decimal:            8
f_timestamp:          2021-12-14 17:54:56
f_date:               2021-12-14
f_string:             hello world
f_varchar:            hello world
f_char:               hello world
f_bool:               true
f_binary:             hello world
f_array_int:          [1,2,3]
f_array_string:       ['hello world','hello world']
f_array_float:        [1.1,1.2]
f_array_array_int:    [[1,2],[3,4]]
f_array_array_string: [['a','b'],['c','d']]
f_array_array_float:  [[1.11,2.22],[3.33,4.44]]
day:                  2021-09-18

1 rows in set. Elapsed: 0.357 sec.

4.hive数据导入ck

1.利用工具waterdrop

支持从Kafka、HDFS、Kudu中读取数据，进行各种各样的数据处理，并将结果写入ClickHouse、Elasticsearch或者Kafka中。

input_file='hadoop fs -ls /user/hive/warehouse/clickhouse.db/${test_text}//|awk ‘{print $KaTeX parse error: Expected 'EOF', got '}' at position 2: 8}̲'>$ {current_dir}/input_file.txt’
:-]

(38条消息) 篇五|ClickHouse数据导入(Flink、Spark、Kafka、MySQL、Hive)_大数据技术与数仓的博客-CSDN博客

3.用thrift链接，失败

这个是官网写法，但是报错，说找不到这个库，但是给了hdfs绝对路径就好了

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ae7NSZTq-1681307558085)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411135111812.png)]

分析：可能是版本原因

无法定位到文件路径

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FHMmfKHI-1681307558085)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411135424359.png)]

不支持partition by

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-g8yGkhTQ-1681307558085)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411135632216.png)]

5.以csv为中介转达

Hive是基于Hadoop的数据仓库工具，而ClickHouse是一个高性能的列式数据库管理系统，支持海量数据的快速查询和分析。下面介绍将Hive数据导入到ClickHouse的方法：

先检查是否有以下配置

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DcR7b1Pw-1681307558086)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411173436517.png)]

1.将Hive数据导出为CSV文件

使用Hive的命令行工具或其他工具，将需要导入到ClickHouse的Hive表数据导出为CSV文件。例如，使用以下命令将Hive表数据导出为CSV文件：

hive -e "SELECT * FROM hive_table" | sed 's/[\t]/,/g' > /path/to/csv/file.csv

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0mopYyPQ-1681307558086)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411173250725.png)]

这个命令将Hive表hive_table的数据导出为CSV文件，并将制表符替换为逗号。

2.创建ClickHouse表

在ClickHouse中创建与Hive表对应的表结构。例如，如果Hive表包含3个列（id，name和age），则可以使用以下命令在ClickHouse中创建相应的表：

CREATE TABLE clickhouse_table (
    id UInt32,
    name String,
    age UInt8,
    aparment String
) ENGINE = MergeTree ORDER BY id;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zxb2uxSo-1681307558086)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411180420036.png)]

这个命令创建名为clickhouse_table的表，包含与Hive表相同的3个列，并按照id列排序。

3.将CSV文件导入到ClickHouse

使用ClickHouse的命令行工具或其他工具，将CSV文件导入到ClickHouse中。例如，使用以下命令将CSV文件导入到ClickHouse中：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LzV3XoP0-1681307558086)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411180342806.png)]

 有列名
 clickhouse-client  --query="INSERT INTO test.student FORMAT CSVWithNames " --format_csv_with_names=1 < /usr/data/student.csv

没有列名
clickhouse-client  --query="INSERT INTO test.student FORMAT CSV " < /usr/data/student.csv

这个命令将CSV文件导入到名为clickhouse_table的ClickHouse表中。

完成以上步骤后，你就可以在ClickHouse中查询导入的数据了。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-M86GKBvD-1681307558087)(C:\Users\超长腿迪迦\AppData\Roaming\Typora\typora-user-images\image-20230411180205073.png)]

分区分片

分区

表分区

1.DDL操作词是partition by 按某一列数据进行分区

2.允许查询在指定了分区键的条件下，尽可能的少读取数据

分片shard

1.复用数据库的分区，相当于在原有的分区下，作为第二层分区，在不同的节点/机器上的体现

2.允许多台机器/节点同并行执行查询，实现了分布式并行计算

多样化的分片功能，为业务优化打开了想象空间。比如在hash sharding的情况下，JOIN计算能够避免数据shuffle，直接在本地进行local join；支持自定义sharding，可以为不同业务和SQL Pattern定制最适合的分片策略；利用自定义sharding功能，通过设置合理的sharding expression可以解决分片间数据倾斜问题等。
另外，sharding机制使得ClickHouse可以横向线性拓展，构建大规模分布式集群，从而具备处理海量数据的能力。

不过ClickHouse的集群的水平拓展目前是一个瓶颈，因为历史数据的存在，避免新增节点之后的数据倾斜是个难点。

1.random随机分片

写入数据会被随机分发到分布式集群的某个节点上

2.constat 固定分片

写入数据会被分发到固定一个节点上

3.column value分片

按照某一列的值进行hash分片

4.自定义表达式分片

指定任意合法表达式

数据partition

ClickHouse支持PARTITION BY子句，在建表时可以指定按照任意合法表达式进行数据分区操作，比如通过toYYYYMM()将数据按月进行分区、toMonday()将数据按照周几进行分区、对Enum类型的列直接每种取值作为一个分区等。

数据Partition在ClickHouse中主要有两方面应用：

在partition key上进行分区裁剪，只查询必要的数据。灵活的partition expression设置，使得可以根据SQL Pattern进行分区设置，最大化的贴合业务特点

对partition进行TTL管理，淘汰过期的分区数据。

1.关键字创建分区

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MergeTree()
ORDER BY expr
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]
[SETTINGS name=value, ...]

2.使用函数创建分区

es7_node2 :) create table tb_test_has_partitions(`id` Int64,`vipId` Int64,`brandId` Int32,`shopId` Int32, `saleDate` Date,saleMoney Float32) engine = MergeTree() PARTITION BY toYYYYMM(saleDate) ORDER BY (brandId,shopId)
 
CREATE TABLE tb_test_has_partitions
(
    `id` Int64,
    `vipId` Int64,
    `brandId` Int32,
    `shopId` Int32,
    `saleDate` Date,
    `saleMoney` Float32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(saleDate)
ORDER BY (brandId, shopId)
 
Query id: d33702ef-787d-4a27-94c9-e639d61b3012
 
Ok.
 
0 rows in set. Elapsed: 0.006 sec. 
 
es7_node2 :)

3.使用现有字段创建分区

es7_node2 :) create table tb_test_has_partitions2(`id` Int64,`vipId` Int64,`brandId` Int32,`shopId` Int32, `saleDate` Date,saleMoney Float32) engine = MergeTree() PARTITION BY (brandId) ORDER BY (brandId,shopId)
 
CREATE TABLE tb_test_has_partitions2
(
    `id` Int64,
    `vipId` Int64,
    `brandId` Int32,
    `shopId` Int32,
    `saleDate` Date,
    `saleMoney` Float32
)
ENGINE = MergeTree()
PARTITION BY brandId
ORDER BY (brandId, shopId)
 
Query id: 9af127e4-5b78-460c-ae66-f122a24a46e9
 
Ok.
 
0 rows in set. Elapsed: 0.008 sec. 
 
es7_node2 :)

(saleDate)
ORDER BY (brandId, shopId)

Query id: d33702ef-787d-4a27-94c9-e639d61b3012

Ok.

0 rows in set. Elapsed: 0.006 sec.

es7_node2 😃


##### 3.使用现有字段创建分区

```sql
es7_node2 :) create table tb_test_has_partitions2(`id` Int64,`vipId` Int64,`brandId` Int32,`shopId` Int32, `saleDate` Date,saleMoney Float32) engine = MergeTree() PARTITION BY (brandId) ORDER BY (brandId,shopId)
 
CREATE TABLE tb_test_has_partitions2
(
    `id` Int64,
    `vipId` Int64,
    `brandId` Int32,
    `shopId` Int32,
    `saleDate` Date,
    `saleMoney` Float32
)
ENGINE = MergeTree()
PARTITION BY brandId
ORDER BY (brandId, shopId)
 
Query id: 9af127e4-5b78-460c-ae66-f122a24a46e9
 
Ok.
 
0 rows in set. Elapsed: 0.008 sec. 
 
es7_node2 :)

超短腿迪迦

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ClickHouse

指定任意合法表达式数据partitionClickHouse支持PARTITION BY子句，在建表时可以指定按照任意合法表达式进行数据分区操作，比如通过toYYYYMM()将数据按月进行分区、toMonday()将数据按照周几进行分区、对Enum类型的列直接每种取值作为一个分区等。数据Partition在ClickHouse中主要有两方面应用：在partition key上进行分区裁剪，只查询必要的数据。
复制链接

扫一扫