Colocat Join 理解

前言

         在使用Doris的时候,help create table发现colocat join的建表语句,但是从未见过colocat join这个词语。所以想了解什么意思,但是百度翻译中文“共置联结”,搜了下bing,来勉强解释下。

//Doris help create table
 7. 创建两张支持Colocat Join的表t1 和t2
        CREATE TABLE `t1` (
        `id` int(11) COMMENT "",
        `value` varchar(8) COMMENT ""
        ) ENGINE=OLAP
        DUPLICATE KEY(`id`)
        DISTRIBUTED BY HASH(`id`) BUCKETS 10
        PROPERTIES (
        "colocate_with" = "t1"
        );

        CREATE TABLE `t2` (
        `id` int(11) COMMENT "",
        `value` varchar(8) COMMENT ""
        ) ENGINE=OLAP
        DUPLICATE KEY(`id`)
        DISTRIBUTED BY HASH(`id`) BUCKETS 10
        PROPERTIES (
        "colocate_with" = "t1"
        );

正文

”When two tables are partitioned on columns and colocated, it forces partitions having the same values for those columns in both tables to be located on the same SnappyData server. “

翻译:当两张表按列分区场景下,如果设置colocated,那么分区中相同列如果value相同,会放在一个节点上。

“When colocated tables are joined on the partitioning columns, the join happens locally on the node where data is present, without the need of shuffling the data.”

翻译:分区场景下,当两张表被设置colocate join的话,本节点发生join操作,从而避免shuffle数据。(因为很多分布式数据库做join的时候会伴随shuffle产生,shuffle很消耗资源)

 

下面是spark的colocated样例。

//Get a SnappySession:

val spark: SparkSession = SparkSession
    .builder
    .appName("ColocatedJoinExample")
    .master("local[*]")
    .getOrCreate

val snSession = new SnappySession(spark.sparkContext)

//Create Table Customer:

snSession.sql("CREATE TABLE CUSTOMER ( " +
    "C_CUSTKEY     INTEGER NOT NULL," +
    "C_NAME        VARCHAR(25) NOT NULL," +
    "C_ADDRESS     VARCHAR(40) NOT NULL," +
    "C_NATIONKEY   INTEGER NOT NULL," +
    "C_PHONE       VARCHAR(15) NOT NULL," +
    "C_ACCTBAL     DECIMAL(15,2)   NOT NULL," +
    "C_MKTSEGMENT  VARCHAR(10) NOT NULL," +
    "C_COMMENT     VARCHAR(117) NOT NULL)" +
    "USING COLUMN OPTIONS (PARTITION_BY 'C_CUSTKEY')")

//Create Table Orders:

snSession.sql("CREATE TABLE ORDERS  ( " +
    "O_ORDERKEY       INTEGER NOT NULL," +
    "O_CUSTKEY        INTEGER NOT NULL," +
    "O_ORDERSTATUS    CHAR(1) NOT NULL," +
    "O_TOTALPRICE     DECIMAL(15,2) NOT NULL," +
    "O_ORDERDATE      DATE NOT NULL," +
    "O_ORDERPRIORITY  CHAR(15) NOT NULL," +
    "O_CLERK          CHAR(15) NOT NULL," +
    "O_SHIPPRIORITY   INTEGER NOT NULL," +
    "O_COMMENT        VARCHAR(79) NOT NULL) " +
    "USING COLUMN OPTIONS (PARTITION_BY 'O_CUSTKEY', " +
    "COLOCATE_WITH 'CUSTOMER' )")
//Perform a Colocate join:

// Selecting orders for all customers
val result = snSession.sql("SELECT C_CUSTKEY, C_NAME, O_ORDERKEY, O_ORDERSTATUS, O_ORDERDATE, " +
    "O_TOTALPRICE FROM CUSTOMER, ORDERS WHERE C_CUSTKEY = O_CUSTKEY").collect()

从以上的spark代码来看,业务就是做了join操作,至于底层是否做了colocate,那和支持的数据源有关,业务不感知。

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页