前言
在使用Doris的时候,help create table发现colocat join的建表语句,但是从未见过colocat join这个词语。所以想了解什么意思,但是百度翻译中文“共置联结”,搜了下bing,来勉强解释下。
//Doris help create table
7. 创建两张支持Colocat Join的表t1 和t2
CREATE TABLE `t1` (
`id` int(11) COMMENT "",
`value` varchar(8) COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"colocate_with" = "t1"
);
CREATE TABLE `t2` (
`id` int(11) COMMENT "",
`value` varchar(8) COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"colocate_with" = "t1"
);
正文
”When two tables are partitioned on columns and colocated, it forces partitions having the same values for those columns in both tables to be located on the same SnappyData server. “
翻译:当两张表按列分区场景下,如果设置colocated,那么分区中相同列如果value相同,会放在一个节点上。
“When colocated tables are joined on the partitioning columns, the join happens locally on the node where data is present, without the need of shuffling the data.”
翻译:分区场景下,当两张表被设置colocate join的话,本节点发生join操作,从而避免shuffle数据。(因为很多分布式数据库做join的时候会伴随shuffle产生,shuffle很消耗资源)
下面是spark的colocated样例。
//Get a SnappySession:
val spark: SparkSession = SparkSession
.builder
.appName("ColocatedJoinExample")
.master("local[*]")
.getOrCreate
val snSession = new SnappySession(spark.sparkContext)
//Create Table Customer:
snSession.sql("CREATE TABLE CUSTOMER ( " +
"C_CUSTKEY INTEGER NOT NULL," +
"C_NAME VARCHAR(25) NOT NULL," +
"C_ADDRESS VARCHAR(40) NOT NULL," +
"C_NATIONKEY INTEGER NOT NULL," +
"C_PHONE VARCHAR(15) NOT NULL," +
"C_ACCTBAL DECIMAL(15,2) NOT NULL," +
"C_MKTSEGMENT VARCHAR(10) NOT NULL," +
"C_COMMENT VARCHAR(117) NOT NULL)" +
"USING COLUMN OPTIONS (PARTITION_BY 'C_CUSTKEY')")
//Create Table Orders:
snSession.sql("CREATE TABLE ORDERS ( " +
"O_ORDERKEY INTEGER NOT NULL," +
"O_CUSTKEY INTEGER NOT NULL," +
"O_ORDERSTATUS CHAR(1) NOT NULL," +
"O_TOTALPRICE DECIMAL(15,2) NOT NULL," +
"O_ORDERDATE DATE NOT NULL," +
"O_ORDERPRIORITY CHAR(15) NOT NULL," +
"O_CLERK CHAR(15) NOT NULL," +
"O_SHIPPRIORITY INTEGER NOT NULL," +
"O_COMMENT VARCHAR(79) NOT NULL) " +
"USING COLUMN OPTIONS (PARTITION_BY 'O_CUSTKEY', " +
"COLOCATE_WITH 'CUSTOMER' )")
//Perform a Colocate join:
// Selecting orders for all customers
val result = snSession.sql("SELECT C_CUSTKEY, C_NAME, O_ORDERKEY, O_ORDERSTATUS, O_ORDERDATE, " +
"O_TOTALPRICE FROM CUSTOMER, ORDERS WHERE C_CUSTKEY = O_CUSTKEY").collect()
从以上的spark代码来看,业务就是做了join操作,至于底层是否做了colocate,那和支持的数据源有关,业务不感知。