ClickHouse查询分布式表LEFT JOIN改RIGHT JOIN的大坑

ClickHouse查询分布式表LEFT JOIN改RIGHT JOIN的大坑

由一个慢查询衍生出的问题

我们线上有一个ClickHouse集群, 总共6个服务器, 配置均为16C 64G SSD, 集群配置为三分片两副本

有两个表这里称为small_tablebig_table. 都是ReplicatedMergeTree引擎(三个分片两个副本).

small_table有79w数据, big_table有5亿数据(数据在之后的示例中没有任何变化), 在下文中small_tablebig_table都为分布式表, 可以获取全量数据, small_table_localbig_table_local为各节点上的本地表名称

SELECT 
    table, 
    formatReadableSize(sum(data_compressed_bytes)) AS tc, 
    formatReadableSize(sum(data_uncompressed_bytes)) AS tu, 
    sum(data_compressed_bytes) / sum(data_uncompressed_bytes) AS ratio
FROM system.columns
WHERE (database = currentDatabase()) AND (table IN ('small_table_local', 'big_table_local'))
GROUP BY table
ORDER BY table ASC

┌─table─────────────────────────┬─tc────────┬─tu────────┬──────────────ratio─┐
│ small_table_local             │ 12.87 MiB │ 14.91 MiB │ 0.8633041477100831 │
│ big_table_local               │ 15.46 GiB │ 57.31 GiB │ 0.2697742507036428 │
└───────────────────────────────┴───────────┴───────────┴────────────────────┘
SELECT count(*)
FROM small_table

┌─count()─┐
│  794469 │
└─────────┘

SELECT count(*)
FROM big_table

┌───count()─┐
│ 519898780 │
└───────────┘

有如下查询

SELECT a.UID,B.UID from dwh.small_table a LEFT JOIN dwh.big_table b on a.UID = b.UID

这个查询在ClickHouse中要跑近300秒

#time clickhouse-client --time --progress --query="
SELECT 
    a.UID, B.UID
FROM
    dwh.small_table a
        LEFT JOIN
    dwh.big_table b ON a.UID = b.UID
" > /dev/null
293.769

real    4m53.798s
user    0m0.574s
sys     0m0.225s

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KUfGWdcJ-1598656246445)(https://raw.githubusercontent.com/Fanduzi/Figure_bed/master/img/%E6%9F%A5%E8%AF%A2%E5%8D%A0%E7%94%A8%E5%86%85%E5%AD%98.png)]

而在TIDB只需要20秒(节点数和配置比CH略好, 数据略多于CH, 未使用TIFlash)

# time mysql -uroot -hxx.xx.xx -P4000 -p dwh -e "
SELECT 
    a.UID, B.UID
FROM
    dwh.small_table a
        LEFT JOIN
    dwh.big_table b ON a.UID = b.UID;
" > /dev/null
Enter password:

real    0m20.955s
user    0m11.292s
sys     0m2.321s

本人接触ClickHouse不久, 没什么实战经验, 看到这结果就感觉肯定是自己使用姿势不对

JOIN操作时一定要把数据量小的表放在右边

一通百度Google, 看到一篇来自携程的文章每天十亿级数据更新,秒出查询结果,ClickHouse在携程酒店的应用, 其中有一段话:

JOIN操作时一定要把数据量小的表放在右边,ClickHouse中无论是Left Join 、Right Join还是Inner Join永远都是拿着右表中的每一条记录到左表中查找该记录是否存在,所以右表必须是小表。

有点神奇…

我们知道在常见的关系型数据库如Oralce、MySQL中, LEFT JOIN和RIGTH JOIN是可以等价改写的, 那么我改成RIGHT JOIN不就"把小表放在右边"了吗, 于是SQL改写为

SELECT a.UID,B.UID from dwh.big_table b RIGHT JOIN dwh.small_table a on a.UID = b.UID

实测

#time clickhouse-client --time --progress --query="
SELECT 
    a.UID, B.UID
FROM
    dwh.big_table b
        RIGHT JOIN
    dwh.small_table a ON a.UID = b.UIDT
" > /dev/null
19.588

real    0m19.609s
user    0m0.742s
sys     0m0.293s

没想到还真好使… 难道CH优化器这么弱?

谨慎起见, 我比对了一下结果, 简单count一下吧

LEFT JOIN

#time clickhouse-client --time --progress --query="
SELECT 
    COUNT(*)
FROM
    dwh.small_table a
        LEFT JOIN
    dwh.big_table b ON a.UID = b.UID
"
6042735 --count
917.560 --时间

real    15m17.580s
user    0m0.253s
sys     0m0.489s

RIGHT JOIN

#time clickhouse-client --time --progress --query="
SELECT 
    COUNT(*)
FROM
    dwh.big_table b
        RIGHT JOIN
    dwh.small_table a ON a.UID = b.UID
"
6897617 --count
11.655 --时间

real    0m11.675s
user    0m0.014s
sys     0m0.017s

RIGHT JOIN数据不对啊

ClickHouse分布式表A LEFT JOIN B != B RIGHT JOIN A

创建测试表

ch-node-05 default@localhost:9000 [dwh]
:) show create table t1;

SHOW CREATE TABLE t1

┌─statement─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE dwh.t1 (`I_ID` String, `CTIME` DateTime) ENGINE = Distributed('ch_cluster_all', 'dwh', 't1_local', rand()) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 

ch-node-05 default@localhost:9000 [dwh]
:) show create table t2;

SHOW CREATE TABLE t2

┌─statement─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE dwh.t2 (`I_ID` String, `CTIME` DateTime) ENGINE = Distributed('ch_cluster_all', 'dwh', 't2_local', rand()) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 

ch-node-05 default@localhost:9000 [dwh]
:) show create table t1_local;

SHOW CREATE TABLE t1_local

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE dwh.t1_local (`I_ID` String, `CTIME` DateTime) ENGINE = ReplicatedReplacingMergeTree('/clickhouse/dwh/tables/{layer}-{shard}/t1', '{replica}') PARTITION BY toDate(CTIME) ORDER BY I_ID SETTINGS index_granularity = 8192 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 

ch-node-05 default@localhost:9000 [dwh]
:) show create table t2_local;

SHOW CREATE TABLE t2_local

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE dwh.t2_local (`I_ID` String, `CTIME` DateTime) ENGINE = ReplicatedReplacingMergeTree('/clickhouse/dwh/tables/{layer}-{shard}/t2', '{replica}') PARTITION BY toDate(CTIME) ORDER BY I_ID SETTINGS index_granularity = 8192 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 

数据

ch-node-05 default@localhost:9000 [dwh]
:) select * from t1;

SELECT *
FROM t1

┌─I_ID─┬───────────────CTIME─┐
│ 1    │ 2020-08-27 15:24:05 │
│ 2    │ 2020-08-27 15:24:50 │
│ 8    │ 2020-08-27 15:24:50 │
└──────┴─────────────────────┘
┌─I_ID─┬───────────────CTIME─┐
│ 3    │ 2020-08-27 15:24:50 │
│ 5    │ 2020-08-27 15:24:50 │
│ 9    │ 2020-08-27 15:24:50 │
└──────┴─────────────────────┘
┌─I_ID─┬───────────────CTIME─┐
│ 10   │ 2020-08-27 15:24:50 │
│ 3    │ 2020-08-27 15:24:50 │
│ 6    │ 2020-08-27 15:24:50 │
│ 7    │ 2020-08-27 15:24:50 │
└──────┴─────────────────────┘

10 rows in set. Elapsed: 0.003 sec. 

ch-node-05 default@localhost:9000 [dwh]
:) select * from t2;

SELECT *
FROM t2

┌─I_ID─┬───────────────CTIME─┐
│ 1    │ 2020-08-27 15:25:14 │
└──────┴─────────────────────┘
┌─I_ID─┬───────────────CTIME─┐
│ 2    │ 2020-08-27 15:25:33 │
│ 5    │ 2020-08-27 15:25:33 │
└──────┴─────────────────────┘
┌─I_ID─┬───────────────CTIME─┐
│ 3    │ 2020-08-27 15:25:33 │
│ 3    │ 2020-08-27 15:25:33 │
└──────┴─────────────────────┘

5 rows in set. Elapsed: 0.003 sec. 

ch-node-05 default@localhost:9000 [dwh]
:) SELECT 
:-]     _shard_num, 
:-]     count(*)
:-] FROM 
:-] (
:-]     SELECT 
:-]         _shard_num, 
:-]         a.*
:-]     FROM dwh.t1 AS a
:-] )
:-] GROUP BY _shard_num
:-]     WITH ROLLUP;

SELECT 
    _shard_num, 
    count(*)
FROM 
(
    SELECT 
        _shard_num, 
        a.*
    FROM dwh.t1 AS a
)
GROUP BY _shard_num
    WITH ROLLUP

┌─_shard_num─┬─count()─┐
│          3 │       3 │
│          2 │       3 │
│          1 │       4 │
└────────────┴─────────┘
┌─_shard_num─┬─count()─┐
│          0 │      10 │
└────────────┴─────────┘

4 rows in set. Elapsed: 0.004 sec. 

ch-node-05 default@localhost:9000 [dwh]
:) SELECT 
:-]     _shard_num, 
:-]     count(*)
:-] FROM 
:-] (
:-]     SELECT 
:-]         _shard_num, 
:-]         a.*
:-]     FROM dwh.t2 AS a
:-] )
:-] GROUP BY _shard_num
:-]     WITH ROLLUP;

SELECT 
    _shard_num, 
    count(*)
FROM 
(
    SELECT 
        _shard_num, 
        a.*
    FROM dwh.t2 AS a
)
GROUP BY _shard_num
    WITH ROLLUP

┌─_shard_num─┬─count()─┐
│          3 │       2 │
│          2 │       1 │
│          1 │       2 │
└────────────┴─────────┘
┌─_shard_num─┬─count()─┐
│          0 │       5 │
└────────────┴─────────┘

4 rows in set. Elapsed: 0.005 sec. 

测试LEFT JOIN RIGHT JOIN

ch-node-05 default@localhost:9000 [dwh]
:) SELECT 
:-]     a.I_ID, 
:-]     b.I_ID
:-] FROM dwh.t2 AS a
:-] LEFT JOIN dwh.t1 AS b ON a.I_ID = b.I_ID
:-] ORDER BY a.I_ID ASC;

SELECT 
    a.I_ID, 
    b.I_ID
FROM dwh.t2 AS a
LEFT JOIN dwh.t1 AS b ON a.I_ID = b.I_ID
ORDER BY a.I_ID ASC

┌─I_ID─┬─b.I_ID─┐
│ 1    │ 1      │
└──────┴────────┘
┌─I_ID─┬─b.I_ID─┐
│ 2    │ 2      │
│ 3    │ 3      │
│ 3    │ 3      │
│ 3    │ 3      │
│ 3    │ 3      │
└──────┴────────┘
┌─I_ID─┬─b.I_ID─┐
│ 5    │ 5      │
└──────┴────────┘

7 rows in set. Elapsed: 0.006 sec. 

ch-node-05 default@localhost:9000 [dwh]
:) SELECT 
:-]     a.I_ID, 
:-]     b.I_ID
:-] FROM dwh.t1 AS b
:-] RIGHT JOIN dwh.t2 AS a ON a.I_ID = b.I_ID
:-] ORDER BY a.I_ID ASC;

SELECT 
    a.I_ID, 
    b.I_ID
FROM dwh.t1 AS b
RIGHT JOIN dwh.t2 AS a ON a.I_ID = b.I_ID
ORDER BY toUInt32(a.I_ID) ASC

┌─a.I_ID─┬─I_ID─┐
│ 1      │      │
│ 1      │ 1    │
│ 1      │      │
│ 2      │      │
│ 2      │ 2    │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 2      │      │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 3      │ 3    │
│ 3      │ 3    │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 3      │      │
│ 3      │      │
│ 3      │ 3    │
│ 3      │ 3    │
│ 5      │      │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 5      │      │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 5      │ 5    │
└────────┴──────┘

可以看到RIGHT JOIN返回了一些错误的数据

难道想要对于这个SQL, 就不能用分布式表了吗? 如果我想用RIGHT JOIN就只能用单机? 说好的水平扩展呢?

那看一下单机查询速度吧…

为此我在一个CH节点创建两个表small_table_totalbig_table_toal 他们都不是分布式表, 都拥有全量数据

SELECT count(*)
FROM `big_table_toal`

┌───count()─┐
│ 519898780 │
└───────────┘

1 rows in set. Elapsed: 0.001 sec. 


SELECT count(*)
FROM `small_table_total`

┌─count()─┐
│  794469 │
└─────────┘

分布式表只能和分布式表关联, 分布式表无法和本地表关联

#clickhouse-client --time --progress --query="SELECT count(*) from dwh.big_table b RIGHT JOIN dwh.small_table_total a on a.UID = b.UID"      
→ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) Received exception from server (version 20.3.5):
Code: 60. DB::Exception: Received from localhost:9000. DB::Exception: Received from bj2-ch-node-04:9000. DB::Exception: Table dwh.small_table_total doesn't exist.. 
0.107

测试查询速度

#clickhouse-client --time --progress --query="
SELECT 
    COUNT(*)
FROM
    dwh.big_table_total b
        RIGHT JOIN
    dwh.small_table_total a ON a.UID = b.UID
"
6042735 --count
7.262  --用时

好家伙, 数据准确, 而且比分片了还快

难道只能用本地表?

分布式表要想RIGHT JOIN返回正确结果, 只能改写SQL

原始语句

SELECT 
    a.UID, b.UID
FROM
    dwh.small_table a
        LEFT JOIN
    dwh.big_table b ON a.UID = b.UID

改写为INNER JOIN, 但是没有改表顺序(性能差)

SELECT 
    a.id, b.uid
FROM
    dwh.small_table a
        GLOBAL INNER JOIN
    dwh.big_table b ON a.UID = b.UID 
UNION ALL SELECT 
    a.UID, NULL
FROM
    dwh.small_table a
WHERE
    a.UID GLOBAL NOT IN (SELECT 
            UID
        FROM
            dwh.big_table)

这里我还没理解为什么要用GLOBAL JOIN

在我的例子中 ,这个语句根本跑不成功, GLOBAL太耗费内存了

SELECT 
    a.UID, 
    b.UID
FROM dwh.small_table AS a
GLOBAL INNER JOIN dwh.big_table AS b ON a.UID = b.UID
UNION ALL
SELECT 
    a.UID, 
    NULL
FROM dwh.small_table AS a
WHERE a.UID GLOBAL NOT IN 
(
    SELECT UID
    FROM dwh.big_table
)

↑ Progress: 220.53 million rows, 29.82 GB (51.24 million rows/s., 6.93 GB/s.) ████████████████████████████████████▋                                                                                                                                          20%Received exception from server (version 20.3.5):
Code: 241. DB::Exception: Received from localhost:9000. DB::Exception: Memory limit (for query) exceeded: would use 50.00 GiB (attempt to allocate chunk of 4249200 bytes), maximum: 50.00 GiB: (while reading column CH_BILL_USER_TELEPHONE): (while reading from part /data/clickhouse/ch_9000/data/dwh/big_table_local/201910_0_5_1/ from mark 216 with max_rows_to_read = 8192): 
Code: 241, e.displayText() = DB::Exception: Memory limit (for query) exceeded: would use 50.00 GiB (attempt to allocate chunk of 4227680 bytes), maximum: 50.00 GiB: (while reading column CH_XXX_NAME): (while reading from part /data/clickhouse/ch_9000/data/dwh/big_table_local/202001_6_11_1/ from mark 240 with max_rows_to_read = 8192) (version 20.3.5.21 (official build)): 
Code: 241, e.displayText() = DB::Exception: Memory limit (for query) exceeded: would use 50.00 GiB (attempt to allocate chunk of 5211280 bytes), maximum: 50.00 GiB: (avg_value_size_hint = 66, avg_chars_size = 69.6, limit = 8192): (while reading column CH_BROKER_NAME): (while reading from part /data/clickhouse/ch_9000/data/dwh/big_table_local/202007_6_11_1/ from mark 24 with max_rows_to_read = 8192) (version 20.3.5.21 (official build)): 
Code: 241, e.displayText() = DB::Exception: Memory limit (for query) exceeded: would use 50.00 GiB (attempt to allocate chunk of 4572064 bytes), maximum: 50.00 GiB: (avg_value_size_hint = 66.00048828125, avg_chars_size = 69.6005859375, limit = 8192): (while reading column CH_XX_NAME): (while reading from part /data/clickhouse/ch_9000/data/dwh/big_table_local/201805_2_2_0/ from mark 24 with max_rows_to_read = 8192) (version 20.3.5.21 (official build)): While executing CreatingSetsTransform. 

0 rows in set. Elapsed: 4.404 sec. Processed 220.53 million rows, 29.82 GB (50.07 million rows/s., 6.77 GB/s.) 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yRfwYkgo-1598656246449)(https://raw.githubusercontent.com/Fanduzi/Figure_bed/master/img/GLOBAL_JOIN_memory.png)]

如果去掉GLOBAL JOIN, 也不行(GLOBAL IN不能去)

SELECT 
    a.UID, 
    b.UID
FROM dwh.small_table AS a
INNER JOIN dwh.big_table AS b ON a.UID = b.UID
UNION ALL
SELECT 
    a.UID, 
    NULL
FROM dwh.small_table AS a
WHERE a.UID GLOBAL NOT IN 
(
    SELECT UID
    FROM dwh.big_table
)

↑ Progress: 1.91 billion rows, 105.59 GB (6.36 million rows/s., 352.30 MB/s.) ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                90%Received exception from server (version 20.3.5):
Code: 241. DB::Exception: Received from localhost:9000. DB::Exception: Received from clickhouse-node-01:9000. DB::Exception: Memory limit (total) exceeded: would use 50.10 GiB (attempt to allocate chunk of 133829856 bytes), maximum: 50.00 GiB: While executing CreatingSetsTransform. 

0 rows in set. Elapsed: 299.809 sec. Processed 1.91 billion rows, 105.59 GB (6.35 million rows/s., 352.18 MB/s.) 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qx71EIoG-1598656246451)(https://raw.githubusercontent.com/Fanduzi/Figure_bed/master/img/JOIN_memory.png)]

改写表顺序, 让小表在"右边"

SELECT a.UID FROM dwh.big_table b
GLOBAL INNER JOIN dwh.small_table a
ON a.UID = b.UID
UNION ALL
SELECT a.UID,null FROM dwh.small_table a
WHERE a.UID GLOBAL NOT IN
(
    SELECT UID FROM dwh.big_table
    WHERE UID GLOBAL IN (SELECT id FROM dwh.small_table)
)

实测

time clickhouse-client --time --progress --query="
SELECT a.UID,b.UID FROM dwh.big_table b
GLOBAL INNER JOIN dwh.small_table a
on a.UID = b.UID
UNION ALL
SELECT a.UID,null FROM dwh.small_table a
WHERE a.UID GLOBAL NOT IN
(
    SELECT UID FROM dwh.big_table
    WHERE UID GLOBAL IN (SELECT UID FROM dwh.small_table)
)" >/dev/null
21.142

real    0m21.164s
user    0m1.133s
sys     0m0.378s

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vIffKb3r-1598656246453)(https://raw.githubusercontent.com/Fanduzi/Figure_bed/master/img/rewrite_sql_memory.png)]

看一下行数对不对

time clickhouse-client --time --progress --query="
SELECT sum(cnt) FROM (
SELECT count(*)  cnt FROM dwh.big_table b
GLOBAL INNER JOIN dwh.small_table a
on a.UID = b.UID
UNION ALL
SELECT count(*) cnt FROM dwh.small_table a
WHERE a.UID GLOBAL NOT IN
(
    SELECT UID FROM dwh.big_table
    WHERE UID GLOBAL IN (SELECT UID FROM dwh.small_table)
))"
6042735 --count
12.525  --用时

real    0m12.545s
user    0m0.018s
sys     0m0.012s

最后一个问题

不知道你们是否注意到了

ClickHouse

SELECT 
    a.I_ID, 
    b.I_ID
FROM dwh.t1 AS b
RIGHT JOIN dwh.t2 AS a ON a.I_ID = b.I_ID
ORDER BY a.I_ID ASC

┌─a.I_ID─┬─I_ID─┐
│ 1      │      │
│ 1      │ 1    │
│ 1      │      │
│ 2      │      │
│ 2      │ 2    │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 2      │      │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 3      │ 3    │
│ 3      │ 3    │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 3      │      │
│ 3      │      │
│ 3      │ 3    │
│ 3      │ 3    │
│ 5      │      │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 5      │      │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 5      │ 5    │
└────────┴──────┘

MySQL(例子无关)

root@localhost 15:15:26 [fanboshi]> select a.id,b.id from t1 a left join t3 b on a.id=b.id; 
+----+------+
| id | id   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 | NULL |
|  6 | NULL |
|  7 | NULL |
|  8 | NULL |
|  9 | NULL |
| 11 | NULL |
| 13 | NULL |
| 15 | NULL |
| 17 | NULL |
| 19 | NULL |
| 21 | NULL |
| 23 | NULL |
| 25 | NULL |
| 27 | NULL |
| 29 | NULL |
| 30 | NULL |
| 31 | NULL |
+----+------+
21 rows in set (0.00 sec)

在CH里外链接不想MySQL那样"用null补未匹配的数据"而是用该列数据类型的默认值填充

https://github.com/ClickHouse/ClickHouse/blob/master/src/Core/Settings.h#L189

join_use_nulls可以在语句,用户profile添加

SELECT 
    a.I_ID, 
    b.I_ID
FROM dwh.t1 AS b
RIGHT JOIN dwh.t2 AS a ON a.I_ID = b.I_ID
ORDER BY toUInt32(a.I_ID) ASC
SETTINGS join_use_nulls = 1

┌─a.I_ID─┬─I_ID─┐
│ 1      │ ᴺᵁᴸᴸ │
│ 1      │ 1    │
│ 1      │ ᴺᵁᴸᴸ │
│ 2      │ ᴺᵁᴸᴸ │
│ 2      │ 2    │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 2      │ ᴺᵁᴸᴸ │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 3      │ 3    │
│ 3      │ 3    │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 3      │ ᴺᵁᴸᴸ │
│ 3      │ ᴺᵁᴸᴸ │
│ 3      │ 3    │
│ 3      │ 3    │
│ 5      │ ᴺᵁᴸᴸ │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 5      │ ᴺᵁᴸᴸ │
└────────┴──────┘
┌─a.I_ID─┬─I_ID─┐
│ 5      │ 5    │
└────────┴──────┘

15 rows in set. Elapsed: 0.015 sec. 

结论

也算不上啥结论

  1. 小表写后面
  2. 不要用RIGHT JOIN, 而是按照示例改写SQL

参考资料

我提了issue, 详细原因请看https://github.com/ClickHouse/ClickHouse/issues/14160

总之除了LEFT JOIN 外 For other OUTER JOINs there's no general solution to return expected result yet

  • 10
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
ClickHouse是一个高性能列式数据库系统,特别适合大数据分析和在线事务处理(OLAP)场景。在ClickHouse中,数据可以存储在本地分布式中,它们各自有不同的特点和应用场景。 **1. 本地(Local Table):** - **定义:** 本地是存储在单个节点(或副本集中的某个节点)上的数据结构,适合较小的数据集或对实时查询有较高要求的情况。 - **优点:** 查询速度快,因为数据是按照列存储的,并且可以直接从磁盘读取,减少了网络延迟。 - **缺点:** 随着数据量的增长,扩展性有限,如果需要更大的存储容量或更高的并发访问,需要手动复制到其他节点或使用分布式。 **2. 分布式(Distributed Table):** - **定义:** 分布式是由多个本地组成,数据分布在多台服务器上,每个部分存储一部分数据。这样可以提供更好的水平扩展和容错能力。 - **优点:** 可以处理大量数据,支持并行查询,提高了处理大规模数据的能力。数据分布可以根据负载均衡策略自动调整。 - **缺点:** 查询可能涉及网络I/O,速度可能会受到网络延迟的影响。此外,分布式的复杂性也意味着维护和管理可能更复杂一些。 - **创建与查询:** 创建分布式时需要指定一个分布式键(distr_id),并指定参与存储的数据源本地查询分布式时,ClickHouse会自动将查询分散到各个子,执行结果再汇总。 **相关问题--:** 1. 如何在ClickHouse中创建本地? 2. 如何设置分布式的复制策略? 3. 分布式如何提高查询性能? 4. 如何进行分布式的故障恢复? 5. 分布式和并行查询有何关联?

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值