Broker Load 是一种基于 MySQL 协议的异步导入方式。您提交导入作业以后,StarRocks 会异步地执行导入作业。您可以使用 SELECT * FROM information_schema.loads
来查看 Broker Load 作业的结果,该功能自 3.1 版本起支持
Broker Load语法
LOAD LABEL [<database_name>.]<label_name>
(
data_desc[, data_desc ...]
)
WITH BROKER
(
StorageCredentialParams
)
[PROPERTIES
(
opt_properties
)
]
label_name :指定导入作业的标签
data_desc:用于描述一批次待导入的数据。每个 data_desc 声明了本批次待导入数据所属的数据源地址、ETL 函数、StarRocks 表和分区等信息,Broker Load 支持一次导入多个数据文件;Broker Load 还支持保证单次导入事务的原子性,即单次导入的多个数据文件都成功或者都失败
data_desc 语法
DATA INFILE ("<file_path>"[, "<file_path>" ...])
[NEGATIVE]
INTO TABLE <table_name>
[PARTITION (<partition1_name>[, <partition2_name> ...])]
[TEMPORARY PARTITION (<temporary_partition1_name>[, <temporary_partition2_name> ...])]
[COLUMNS TERMINATED BY "<column_separator>"]
[ROWS TERMINATED BY "<row_separator>"]
[FORMAT AS "CSV | Parquet | ORC"]
[(format_type_options)]
[(column_list)]
[COLUMNS FROM PATH AS (<partition_field_name>[, <partition_field_name> ...])]
[SET <k1=f1(v1)>[, <k2=f2(v2)> ...]]
[WHERE predicate]
说明
file_path:用于指定源数据文件所在的路径;也可以用通配符指定导入某个路径下所有的数据文件
INTO TABLE:指定目标 StarRocks 表的名称
NEGATIVE:用于撤销某一批已经成功导入的数据,仅适用于目标 StarRocks 表使用聚合表、并且所有 Value 列的聚合函数均为 sum 的情况
PARTITION:指定分区
TEMPORARY PARTITION:指定临时分区
COLUMNS TERMINATED BY:源数据文件中的列分隔符
ROWS TERMINATED BY:源数据文件中的行分隔符
FORMAT AS:指定源数据文件的格式。取值包括 CSV、Parquet 和 ORC
format_type_options:如果文件的格式为 CSV,用于指定 CSV 格式选项,
其可设置的参数有skip_header:用于指定跳过 CSV 文件最开头的几行数据;trim_space:是否去除 CSV 文件中列分隔符前后的空格;enclose:指定把 CSV 文件中的字段括起来的字符;escape:指定用于转义的字符
column_list:指定源数据文件和 StarRocks 表之间的列对应关系
COLUMNS FROM PATH AS:指定的文件路径中提取一个或多个分区字段的信息
SET:将源数据文件的某一列按照指定的函数进行转化,然后将转化后的结果落入 StarRocks 表
WHERE:指定过滤条件,对做完转换的数据进行过滤
WITH BROKER
StarRocks 访问存储系统的认证配置
以hdfs为例
HDFS 支持简单认证和 Kerberos 认证两种认证方式
简单认证:
"hadoop.security.authentication" = "simple",
"username" = "<hdfs_username>",
"password" = "<hdfs_password>"
Kerberos 认证:
"hadoop.security.authentication" = "kerberos",
"kerberos_principal" = "nn/zelda1@ZELDA.COM",
"kerberos_keytab" = "/keytab/hive.keytab",
"kerberos_keytab_content" = "YWFhYWFh"
opt_properties
用于指定一些导入相关的可选参数,指定的参数设置作用于整个导入作业
语法:
PROPERTIES ("<key1>" = "<value1>"[, "<key2>" = "<value2>" ...])
参数说明
timeout:导入作业的超时时间
max_filter_ratio:导入作业的最大容忍率
log_rejected_record_num:最多允许记录多少条因数据质量不合格而过滤掉的数据行数
load_mem_limit:导入作业的内存限制
strict_mode:是否开启严格模式
timezone:指定导入作业所使用的时区
priority:指定导入作业的优先级
merge_condition:用于指定作为更新生效条件的列名
补充:StarRocks 自 3.2.3 版本起支持导入 JSON 格式的数据
jsonpaths:用于指定待导入的字段的名称
strip_outer_array:用于指定是否裁剪最外层的数组结构。
json_root:指定待导入 JSON 数据的根元素
示例
将hdfs中的csv文件导入starrocks中
编写csv文件
"10001","18","张三丰","安徽省合肥市"
"10002","20","王凤山","四川省广元市"
"10003","22","周岐山","山西省运城市"
"10006","17","李诗敏","浙江省温州市"
"10010","19","刘德伟","福建省三明市"
"10100","29","xlz","xx省yy市"
将csv文件导入hdfs中
hdfs dfs -put broker_test_csv01.csv /user/root/starrocks
查看是否导入成功
hdfs dfs -ls /user/root/starrocks
成功
-rw-r--r-- 1 root supergroup 262 2024-04-29 02:32 /user/root/starrocks/broker_test_csv01.csv
starrocks建表
create table broker_csv_test01 (
id STRING comment "id",
name STRING comment "姓名",
age INT comment "年龄",
addr STRING comment "地址"
)PROPERTIES (
"replication_num" = "1"
);
使用broker load方式导入
我们先使用下面的broker导入
LOAD LABEL sr_test.broker_csv_test01
(
DATA INFILE("hdfs://192.168.219.102:9000/user/root/starrocks/broker_test_csv01.csv")
INTO TABLE broker_csv_test01
)
WITH BROKER
(
"username" = "root",
"password" = "123456"
)
PROPERTIES
(
"timeout" = "3600"
);
再通过SELECT * FROM information_schema.loads
查看任务,发现有问题
看不出来哪错了,下载一下TRACKING_URL中的数据看一看
wget -O xx http://192.168.219.102:8040/api/_load_error_log?file=error_log_5f8610afb97d46f2_b23098b16b94540c
可以查看详细的错误信息
也可以通过运行TRACKING_SQL查看问题
select tracking_log from information_schema.load_tracking_logs where job_id=12144
改下broker
LOAD LABEL sr_test.broker_csv_test02 --标签
(
DATA INFILE("hdfs://192.168.219.102:9000/user/root/starrocks/broker_test_csv01.csv") --文件地址
INTO TABLE broker_csv_test01 --目标表
COLUMNS TERMINATED BY "," --列分隔符
ROWS TERMINATED BY "\n" --行分隔符
FORMAT AS CSV --数据格式
(
skip_header = 0
enclose = "\""
)
(id,age,name,addr)
)
WITH BROKER
(
"username" = "root",
"password" = "123456"
)
PROPERTIES
(
"timeout" = "3600"
);
查看一下结果
ok没什么问题
试一试json文件
先建表
create table sr_test.broker_json_test01 (
category STRING comment "种类",
author STRING comment "作者",
title string comment "主题",
price INT comment "价格",
timestamp int comment "时间"
)PROPERTIES (
"replication_num" = "1"
);
编写json文件并导入hdfs
{
"id": 10001,
"RECORDS":[
{"category":"11","title":"SayingsoftheCentury","price":895,"timestamp":1589191587},
{"category":"22","author":"2avc","price":895,"timestamp":1589191487},
{"category":"33","author":"3avc","title":"SayingsoftheCentury","timestamp":1589191387}
],
"comments": ["3 records", "there will be 3 rows"]
}
hdfs dfs -put broker_test_json01.json /user/root/starrocks
导入到sr
LOAD LABEL sr_test.label02
(
DATA INFILE("hdfs://192.168.219.102:9000/user/root/starrocks/broker_test_json01.json")
INTO TABLE broker_json_test01
FORMAT AS "json"
(category, price, author,title,timestamp)
)
WITH BROKER
(
"username" = "root",
"password" = "123456"
)
PROPERTIES
(
"json_root"="$.RECORDS",
"strip_outer_array" = "true",
"jsonpaths" = "[\"$.category\",\"$.price\",\"$.author\",\"$.title\",\"$.timestamp\"]"
);
查询
没毛病
broker load也支持单个导入任务同时导入多张表,我们尝试一下把上面的csv及json文件在一个导入任务中完成导入
LOAD LABEL sr_test.label03
(
DATA INFILE("hdfs://192.168.219.102:9000/user/root/starrocks/broker_test_json01.json")
INTO TABLE broker_json_test01
FORMAT AS "json"
(category, price, author,title,timestamp),
DATA INFILE("hdfs://192.168.219.102:9000/user/root/starrocks/broker_test_csv01.csv")
INTO TABLE broker_csv_test01
COLUMNS TERMINATED BY ","
ROWS TERMINATED BY "\n"
FORMAT AS CSV
(
skip_header = 0
enclose = "\""
)
(id,age,name,addr)
)
WITH BROKER
(
"username" = "root",
"password" = "123456"
)
PROPERTIES
(
"json_root"="$.RECORDS",
"strip_outer_array" = "true",
"jsonpaths" = "[\"$.category\",\"$.price\",\"$.author\",\"$.title\",\"$.timestamp\"]"
);
查询结果
没毛病