StarRocks分区表历史数据删除与管理

一、背景介绍

  • 在使用 StarRocks 时,可能会遇到需要删除大批量数据的情况。然而,StarRocks 对 DELETE 操作的支持并不理想,主要存在以下问题
  1. 不建议执行高频的 DELETE 操作:删除的数据会标记为“Deleted”,暂时保留在 Segment 中,不会立即进行物理删除。Compaction(数据版本合并)完成之后会被回收。
  2. 查询效率可能降低:执行 DELETE 语句后,可能会导致接下来一段时间内(Compaction 完成之前)的查询效率降低。
  3. DELETE 语句不支持函数传参:例如:
DELETE FROM xxx  WHERE DATE(time) < DATE_SUB(CURDATE(), INTERVAL 2 DAY); 

报错:ERROR 1064 (HY000): Getting analyzing error from line 1, column 48 to line 1, column 57. Detail message: Left expr of binary predicate should be column name.

为了更高效地管理分区表中的历史数据,我们可以使用 truncate table partition 操作。然而,StarRocks不支持在 truncate table partition 中使用 WHERE 条件。因此,我们需要编写一个 Shell 脚本来生成相应的 SQL 语句。

二、使用示例

1、创建分区测试表

CREATE  TABLE example_db.test_tb
(   
   `partition_date` date NULL COMMENT "分区日期(yyyy-mm-dd)",
    `id` int NULL COMMENT "主键id",
   `name` STRING NULL COMMENT "姓名",
  `age`  STRING NULL COMMENT "年龄",
   `time` datetime NULL  COMMENT "etl_时间"
) ENGINE = OLAP
DUPLICATE KEY(`partition_date`,`id`)
PARTITION BY RANGE(`partition_date`)
(START ("2024-01-01") END ("2024-06-01") EVERY (INTERVAL 1 MONTH))
DISTRIBUTED BY HASH(`partition_date`) BUCKETS 3
PROPERTIES (
"replication_num" = "3",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "MONTH",
"dynamic_partition.prefix" = "p",
"dynamic_partition.end" = "2",
"dynamic_partition.buckets" = "1"
);

dynamic_partition.enable:开启动态分区特性
dynamic_partition.time_unit:动态分区的时间粒度,取值为 HOURDAY、WEEK、MONTHYEAR
dynamic_partition.prefix:动态分区的前缀名,默认值为 p
dynamic_partition.end:提前创建的分区数量,取值范围为正整数。根据 dynamic_partition.time_unit 属性的不同,以当天(周/月)为基准,提前创建对应范围的分区
dynamic_partition.start:TTL保留的动态分区的起始偏移,取值范围为负整数。根据 dynamic_partition.time_unit 属性的不同,以当天(周/月)为基准,分区范围在此偏移之前的分区将会被删除。(如果不填写,则默认为 Integer.MIN_VALUE,即 -2147483648,表示不删除历史分区。)
dynamic_partition.buckets:分区创建的分桶数

2、插入测试数据

mysql> INSERT INTO  example_db.test_tb VALUES
    -> ('2024-01-01','1', '张三', '20','2024-01-01 10:00:00'),
    -> ('2024-02-01','2', '李四', '20','2024-02-01 10:00:00'),
    -> ('2024-03-01','3', '王五', '20','2024-03-01 10:00:00');
Query OK, 3 rows affected (0.41 sec)
{'label':'insert_ff644af6-2caa-11ef-80c1-1aa2745601c2', 'status':'VISIBLE', 'txnId':'4964505'}

mysql> select * from example_db.test_tb ;
+----------------+------+--------+------+---------------------+
| partition_date | id   | name   | age  | time                |
+----------------+------+--------+------+---------------------+
| 2024-02-01     |    2 | 李四   | 20   | 2024-02-01 10:00:00 |
| 2024-01-01     |    1 | 张三   | 20   | 2024-01-01 10:00:00 |
| 2024-03-01     |    3 | 王五   | 20   | 2024-03-01 10:00:00 |
+----------------+------+--------+------+---------------------+
3 rows in set (0.02 sec)

3、查看分区数据

mysql> show partitions from example_db.test_tb;
+-------------+---------------+----------------+---------------------+--------------------+--------+----------------+----------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+--------------------------+----------+------------+----------+
| PartitionId | PartitionName | VisibleVersion | VisibleVersionTime  | VisibleVersionHash | State  | PartitionKey   | Range                                                                      | DistributionKey | Buckets | ReplicationNum | StorageMedium | CooldownTime        | LastConsistencyCheckTime | DataSize | IsInMemory | RowCount |
+-------------+---------------+----------------+---------------------+--------------------+--------+----------------+----------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+--------------------------+----------+------------+----------+
| 16077719    | p202401       | 2              | 2024-06-17 21:10:41 | 0                  | NORMAL | partition_date | [types: [DATE]; keys: [2024-01-01]; ..types: [DATE]; keys: [2024-02-01]; ) | partition_date  | 3       | 3              | HDD           | 9999-12-31 23:59:59 | NULL                     | 0B       | false      | 0        |
| 16077720    | p202402       | 2              | 2024-06-17 21:10:41 | 0                  | NORMAL | partition_date | [types: [DATE]; keys: [2024-02-01]; ..types: [DATE]; keys: [2024-03-01]; ) | partition_date  | 3       | 3              | HDD           | 9999-12-31 23:59:59 | NULL                     | 0B       | false      | 0        |
| 16077721    | p202403       | 2              | 2024-06-17 21:10:41 | 0                  | NORMAL | partition_date | [types: [DATE]; keys: [2024-03-01]; ..types: [DATE]; keys: [2024-04-01]; ) | partition_date  | 3       | 3              | HDD           | 9999-12-31 23:59:59 | NULL                     | 0B       | false      | 0        |
| 16077722    | p202404       | 1              | 2024-06-17 21:10:25 | 0                  | NORMAL | partition_date | [types: [DATE]; keys: [2024-04-01]; ..types: [DATE]; keys: [2024-05-01]; ) | partition_date  | 3       | 3              | HDD           | 9999-12-31 23:59:59 | NULL                     | 0B       | false      | 0        |
| 16077723    | p202405       | 1              | 2024-06-17 21:10:25 | 0                  | NORMAL | partition_date | [types: [DATE]; keys: [2024-05-01]; ..types: [DATE]; keys: [2024-06-01]; ) | partition_date  | 3       | 3              | HDD           | 9999-12-31 23:59:59 | NULL                     | 0B       | false      | 0        |
| 16077786    | p202406       | 1              | 2024-06-17 21:10:25 | 0                  | NORMAL | partition_date | [types: [DATE]; keys: [2024-06-01]; ..types: [DATE]; keys: [2024-07-01]; ) | partition_date  | 1       | 3              | HDD           | 9999-12-31 23:59:59 | NULL                     | 0B       | false      | 0        |
| 16077791    | p202407       | 1              | 2024-06-17 21:10:25 | 0                  | NORMAL | partition_date | [types: [DATE]; keys: [2024-07-01]; ..types: [DATE]; keys: [2024-08-01]; ) | partition_date  | 1       | 3              | HDD           | 9999-12-31 23:59:59 | NULL                     | 0B       | false      | 0        |
| 16077796    | p202408       | 1              | 2024-06-17 21:10:25 | 0                  | NORMAL | partition_date | [types: [DATE]; keys: [2024-08-01]; ..types: [DATE]; keys: [2024-09-01]; ) | partition_date  | 1       | 3              | HDD           | 9999-12-31 23:59:59 | NULL                     | 0B       | false      | 0        |
+-------------+---------------+----------------+---------------------+--------------------+--------+----------------+----------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+--------------------------+----------+------------+----------+
8 rows in set (0.00 sec)

三、Shell脚本示例

  • 以下是一个示例Shell脚本,用于根据当前时间删除指定时间段前的历史分区数据。该脚本支持按年、月、日分区进行删除。
#!/bin/bash

# 配置数据库连接信息
ip="xxx"
port="xxx"
user="xxx"
passwd="xxx"
db="xxx"
tb="xxx"

# 检查输入参数
if [ $# -ne 3 ]; then
    echo "Usage: $0 {DAY|YEAR|MONTH} {start_date} {end_date}"
    echo "Example: $0 DAY 2024-06-08 2024-06-14"
    exit 1
fi

# 获取参数
partition_type=$1
start_date=$2
end_date=$3

# 转换日期格式
start_date=$(date -d "$start_date" +%Y%m%d)
end_date=$(date -d "$end_date" +%Y%m%d)

# 生成分区列表
partitions=""

case $partition_type in
    DAY)
        current_date=$start_date
        while [ "$current_date" -le "$end_date" ]; do
            partitions+="p$current_date,"
            current_date=$(date -d "$current_date + 1 day" +%Y%m%d)
        done
        ;;

    YEAR)
        start_year=$(date -d "$start_date" +%Y)
        end_year=$(date -d "$end_date" +%Y)
        for ((year=$start_year; year<=$end_year; year++)); do
            partitions+="p$year,"
        done
        ;;

    MONTH)
        start_month=$(date -d "$start_date" +%Y%m)
        end_month=$(date -d "$end_date" +%Y%m)
        current_month=$start_month
        while [ "$(date -d "${current_month}01" +%Y%m)" -le "$end_month" ]; do
            partitions+="p$current_month,"
            current_month=$(date -d "${current_month}01 + 1 month" +%Y%m)
        done
        ;;

    *)
        echo "Invalid partition type: $partition_type"
        echo "Usage: $0 {DAY|YEAR|MONTH} {start_date} {end_date}"
        exit 1
        ;;
esac

# 去掉最后一个逗号
partitions=${partitions%,}

# 生成truncate table语句
truncate_stmt="truncate table $db.$tb partition ($partitions);"

# 输出truncate语句
echo $truncate_stmt

# 执行truncate语句并检查结果
if echo $truncate_stmt | mysql -h $ip -u$user -p"$passwd" -P$port $db; then
    echo "执行成功!"
else
    echo "执行失败!"
fi

1、使用示例

假设我们需要删除以下分区:

  1. 按年分区:从2022-01-01到2024-06-01
  2. 按月分区:从2024-01-01到2024-06-01
  3. 按日分区:从2024-01-01到2024-06-01

我们可以通过以下命令运行脚本:

  • 按年分区
./truncate_partitions.sh YEAR 2022-01-01 2024-06-01

生成truncate table语句:
truncate table example_db.test_tb partition (p2022,p2023,p2024);
执行成功!
  • 按月分区
./truncate_partitions.sh MONTH 2024-01-01 2024-06-01

生成truncate table语句:
truncate table example_db.test_tb partition (p202401,p202402,p202403,p202404,p202405,p202406);
执行成功!
  • 按日分区
./truncate_partitions.sh DAY 2024-01-01 2024-06-01

生成truncate table语句:
truncate table example_db.test_tb partition (p20240101,p20240102,p20240103,p20240104,p20240105,p20240106,p20240107,p20240108,p20240109,p20240110,p20240111,p20240112,p20240113,p20240114,p20240115,p20240116,p20240117,p20240118,p20240119,p20240120,p20240121,p20240122,p20240123,p20240124,p20240125,p20240126,p20240127,p20240128,p20240129,p20240130,p20240131,p20240201,p20240202,p20240203,p20240204,p20240205,p20240206,p20240207,p20240208,p20240209,p20240210,p20240211,p20240212,p20240213,p20240214,p20240215,p20240216,p20240217,p20240218,p20240219,p20240220,p20240221,p20240222,p20240223,p20240224,p20240225,p20240226,p20240227,p20240228,p20240229,p20240301,p20240302,p20240303,p20240304,p20240305,p20240306,p20240307,p20240308,p20240309,p20240310,p20240311,p20240312,p20240313,p20240314,p20240315,p20240316,p20240317,p20240318,p20240319,p20240320,p20240321,p20240322,p20240323,p20240324,p20240325,p20240326,p20240327,p20240328,p20240329,p20240330,p20240331,p20240401,p20240402,p20240403,p20240404,p20240405,p20240406,p20240407,p20240408,p20240409,p20240410,p20240411,p20240412,p20240413,p20240414,p20240415,p20240416,p20240417,p20240418,p20240419,p20240420,p20240421,p20240422,p20240423,p20240424,p20240425,p20240426,p20240427,p20240428,p20240429,p20240430,p20240501,p20240502,p20240503,p20240504,p20240505,p20240506,p20240507,p20240508,p20240509,p20240510,p20240511,p20240512,p20240513,p20240514,p20240515,p20240516,p20240517,p20240518,p20240519,p20240520,p20240521,p20240522,p20240523,p20240524,p20240525,p20240526,p20240527,p20240528,p20240529,p20240530,p20240531,p20240601);
执行成功!

四、总结

  • 通过使用 Shell 脚本,我们可以方便地生成 StarRocks 中 truncate table partition 的 SQL 语句,从而高效地管理分区表中的历史数据。
  • 6
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值