背景
前段时间在做一个功能,中间有个环节是需要将一个json自由文本字段从大宽表中抽取出来,转换成列,以便后续的计算和分析。
json文本格式如下:
{
"launch_time": "3;5;1",
"date": "2021-01-02",
"total_time": "72;34;0",
"pkg": "com.android.launcher;com.android.settings;com.google.android.setupwizard"
}
需要将其转换成:
launch_time | total_time | pkg |
3 | 72 | com.android.launcher |
5 | 34 | com.android.settings |
1 | 0 | com.google.android.setupwizard |
这个需求并不难,但需要对clickhouse的函数有比较灵活的应用,mark一下。
问题分解
1. visitParamExtractString(params, name)解析提取json
2. splitByChar(separator, s)按 ; 将提取出来的value值分割成数组
3. arrayZip(arr1, arr2, ..., arrN)将上一步的三个数组拉链成一个元组数组
4. arrayJoin 这个函数很特殊,可以将一个数组拆分成列
5. untuple(x) 将元组纵向切割成列
模拟构建
--创建测试表
create table test(t_id String, t_json String) ENGINE=TinyLog;
--插入测试数据
insert into test(t_id,t_json) values('T00001','{"launch_time":"3;5;1","date":"2021-01-02","total_time":"72;34;0","pkg":"com.android.launcher;com.android.settings;com.google.android.setupwizard"}'),('T00002','{"launch_time":"15;8;6;5;1;2;1;1;1","date":"1970-01-01","total_time":"1875;106;80;6;4;2;2;2;1","pkg":"com.google.android.setupwizard;com.android.settings;com.tcl.android.launcher;com.setupwizard;com.google.android.gms;com.phone;com.gdpr;com.faceunlock;com.iris"}'),('T00003','{"launch_time":"15;13;16;5;2;1;2;1;1","date":"1970-01-01","total_time":"1875;326;157;6;6;4;2;2;1","pkg":"com.google.android.setupwizard;com.android.launcher;com.android.settings;com.setupwizard;com.gdpr;com.google.android.gms;com.phone;com.faceunlock;com.iris"}')
--查询测试数据
select * from test;
1. value值提取
SELECT
visitParamExtractString(t_json, 'launch_time') AS launch_time,
visitParamExtractString(t_json, 'total_time') AS total_time,
visitParamExtractString(t_json, 'pkg') AS pkg
FROM test
2. split分割
SELECT
splitByChar(';', visitParamExtractString(t_json, 'launch_time')) AS launch_time,
splitByChar(';', visitParamExtractString(t_json, 'total_time')) AS total_time,
splitByChar(';', visitParamExtractString(t_json, 'pkg')) AS pkg
FROM test
3. 拉链操作
arrayZip函数是把多个数组中相同下标的元素打包成一个元组,最终形成一个元组数组。举个栗子:arrayZip([1,2,3],['A','B','C']) = [(1,'A'),(2,'B'),(3,'C')]
select arrayZip(
splitByChar(';',visitParamExtractString(t_json,'launch_time')) as launch_time,
splitByChar(';',visitParamExtractString(t_json,'total_time')) as total_time ,
splitByChar(';',visitParamExtractString(t_json,'pkg')) as pkg )
from test
4. 将元组数组拆成多行
arrayJoin函数可将数组拆成多行,SELECT arrayJoin(['A', 'B', 'C'])的结果如下,因此,当需要把行专成列的时候,常常会使用到这个函数。
┌─arrayJoin(['A', 'B', 'C'])───┐
│ A │
│ B │
│ C │
└──────────────────┘
SELECT t_id,arrayJoin(['A', 'B', 'C']) FROM test的结果为:
┌─t_id───┬─arrayJoin(['A', 'B', 'C'])─┐
│ T00001 │ A │
│ T00001 │ B │
│ T00001 │ C │
│ T00002 │ A │
│ T00002 │ B │
│ T00002 │ C │
│ T00003 │ A │
│ T00003 │ B │
│ T00003 │ C │
└─────┴─────────────────┘
乍一看会以为是t_id和数组中每个元素的迪卡尔积,实际上不是的(只有当数组是一个常量的时候才是迪卡尔积),clickhouse会对结果进行正确的匹配。
select t_id,arrayJoin(arrayZip(
splitByChar(';',visitParamExtractString(t_json,'launch_time')) as launch_time,
splitByChar(';',visitParamExtractString(t_json,'total_time')) as total_time ,
splitByChar(';',visitParamExtractString(t_json,'pkg')) as pkg )) as tuple
from test
5. 元组拆成列
select t_id,untuple(arrayJoin(arrayZip(
splitByChar(';',visitParamExtractString(t_json,'launch_time')) as launch_time,
splitByChar(';',visitParamExtractString(t_json,'total_time')) as total_time ,
splitByChar(';',visitParamExtractString(t_json,'pkg')) as pkg )) as tuple)
from test
(2022.01.06 修改: 不需要进行untuple,直接从tuple中取值)
select t_id,tuple.1 as launch_time,tuple.2 as total_time ,tuple.3 as pkg
from
(select t_id,arrayJoin(arrayZip(
splitByChar(';',visitParamExtractString(t_json,'launch_time')) ,
splitByChar(';',visitParamExtractString(t_json,'total_time')),
splitByChar(';',visitParamExtractString(t_json,'pkg')) )) as tuple
from test)
6. 考虑健壮性
由于arrayZip函数和untuple函数都要求集合中的元素个数要保持一致,因此更加严谨的写法:
select t_id,untuple(arrayJoin(arrayZip(
splitByChar(';',visitParamExtractString(t_json,'launch_time')) as launch_time,
splitByChar(';',visitParamExtractString(t_json,'total_time')) as total_time ,
splitByChar(';',visitParamExtractString(t_json,'pkg')) as pkg )) as tuple)
from test
where t_json is not null and t_json !=''
and countSubstrings(visitParamExtractString(t_json,'launch_time'),';') = countSubstrings(visitParamExtractString(t_json,'pkg'),';')
and countSubstrings(visitParamExtractString(t_json,'launch_time'),';') = countSubstrings(visitParamExtractString(t_json,'total_time'),';')
(2022.01.06 修改: 不需要进行untuple,直接从tuple中取值)
select t_id,
tuple.1 as launch_time,
tuple.2 as total_time ,
tuple.3 as pkg
from (
select t_id,arrayJoin(arrayZip(
splitByChar(';',visitParamExtractString(t_json,'launch_time')),
splitByChar(';',visitParamExtractString(t_json,'total_time')) ,
splitByChar(';',visitParamExtractString(t_json,'pkg')) )) as tuple
from test
where t_json is not null and t_json !=''
and countSubstrings(visitParamExtractString(t_json,'launch_time'),';') = countSubstrings(visitParamExtractString(t_json,'pkg'),';')
and countSubstrings(visitParamExtractString(t_json,'launch_time'),';') = countSubstrings(visitParamExtractString(t_json,'total_time'),';')
)
总结
其实难点不在于解析json,而是在于将json中的元素重新组合成列。And,,Clickhouse的内置函数实在是丰富而强大!