Clickhouse--数组函数

Abner G

已于 2022-01-24 16:02:00 修改

阅读量5.4k

点赞数 2

分类专栏： Clickhouse 文章标签： mysql 数据库 sql

于 2022-01-22 23:30:56 首次发布

本文链接：https://blog.csdn.net/Abner_G/article/details/122638009

版权

Clickhouse 专栏收录该内容

20 篇文章 2 订阅

订阅专栏

Clickhouse--数组函数

☀️前言
⭐️ SQL冗余诟病
✨ arrayJoin
🍍arrayEnumerate
🍋arrayEnumerateDense
- 🔎 总结
🍑arrayEnumerateUniq
- 🎉 理解
🏆总结
👍实际开发使用

☀️前言

日常工作可能会遇到数据库一对多的字段，被用逗号或者其他特殊字符分割，存在一起。类似：
在这里插入图片描述
如果要查询‘信息技术’标签的数据，SQL会很麻烦。下文做具体实现。

⭐️ SQL冗余诟病

以Mysql为例，前言中表。需要查询标签名为"信息技术","现代服务"的数据。这样SQL 大家差不多会使用Like Or语句。具体实现如下

-- Mysql Like Or语句
select
 industry_label_str,id
-- COUNT(*)
from company_info where
  industry_label_str like '%,现代服务,%'
  or industry_label_str like '现代服务,%'
  or industry_label_str like '%,现代服务'
  or industry_label_str = '现代服务'
  or industry_label_str like '%,信息技术,%'
  or industry_label_str like '信息技术,%'
  or industry_label_str like '%,信息技术'
  or industry_label_str = '信息技术';

在这里插入图片描述
虽然完成查询需求，这样的SQL是不是会感觉有点不舒服。
对于上面的数据如果存储在Clickhouse中，可以使用Clickhouse中数组类型的存储。如下:

✨ arrayJoin

官方解释

普通函数不会更改结果集的行数，而只是计算每行中的值（map）。
聚合函数将多行压缩到一行中（fold或reduce）。
’arrayJoin’函数获取每一行并将他们展开到多行（unfold）。
— 来自官方文档
对于上文中的需求，在clickhouse 中使用arrayJoin函数就相对简单了。

select
id,industry_label_ary,
industry_label_str,
arrayJoin(industry_label_ary)
from company_info 
where  arrayJoin(industry_label_ary) in ('现代服务','信息技术');

在这里插入图片描述
会发现数组已经被拆分成多行，可以使用in语句来完成。但是唯一ID发生了重复，并且数据量也不对。

排查之后发现是同时有"信息技术",“现代服务” 的数据的影响。为什么呢？

官方文档有这么一句话
“此函数将数组作为参数，并将该行在结果集中复制数组元素个数。
除了应用此函数的列中的值之外，简单地复制列中的所有值;它被替换为相应的数组值”。
可以理解为，按数据元素个数，将数据分为了多条数据

所以在上述SQL中查到了重复数据。

如果使用arrayJoin ，记得要是用唯一字段去重一下。

修改之后为：

select distinct id,
                industry_label_ary,
                industry_label_str
from company_info
where arrayJoin(industry_label_ary) in ('现代服务', '信息技术');

在这里插入图片描述

🍍arrayEnumerate

🎉 理解

取数组下标
结合排序，可实现开窗函数ROW_NUMBER 开窗函数

建表语句

CREATE TABLE `school_score`
(
    `id`     Int8,
    `name`   Nullable(String),
   `course` Nullable(String),
   `score`  Nullable(Int8)
) engine =MergeTree()
order by (id);

INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (1, 'A', 'Chinese', 80);
INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (10, 'D', 'Chinese', 80);
INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (2, 'B', 'Chinese', 90);
INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (3, 'C', 'Chinese', 70);
INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (4, 'A', 'Math', 70);
INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (5, 'B', 'Math', 100);
INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (6, 'C', 'Math', 80);
INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (7, 'A', 'English', 90);
INSERT INTO school_score(`id`, `name`, `course`, `score`)
VALUES (8, 'B', 'English', 85);
INSERT INTO school_score (`id`, `name`, `course`, `score`)
VALUES (9, 'C', 'English', 99);

select
       --拆分数组
       course,
       score_array,
       score,
       score_sort
       from (
   -- 分组，转变数组获取排序          
   select course,
       groupArray(score) as score_array,
       arrayEnumerate(score_array) as score_sort
from (
      -- 成绩排序
     select * from school_score order by score desc
         ) group by course
                  )

array join score_array as score,
score_sort;

在这里插入图片描述

🔎 总结

实现了开窗函数语法 “ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )” ，但是有局限，无法和其他字段关联。成绩无法和考生关联起来。实际中更多获取下标

🍋arrayEnumerateDense

select
       --拆分数组
       course,
       score_array,
       score,
       score_sort
       from (
   -- 分组，转变数组获取排序          
   select course,
       groupArray(score) as score_array,
       arrayEnumerateDense(score_array) as score_sort
from (
      -- 成绩排序
     select * from school_score order by score desc
         ) group by course
                  )

array join score_array as score,
score_sort;

在这里插入图片描述

🔎 总结

实现了开窗函数语法 “DENSE_RANK() OVER( PARTITION BY id ORDER BY val )” ，但是有局限，无法和其他字段关联。成绩无法和考生关联起来。

🍑arrayEnumerateUniq

🎉 理解

获取每个元素在数组出现的次数

select course,
       groupArray(score) as score_array,
       arrayEnumerateUniq(score_array) as score_sort
from (
      -- 成绩排序
     select * from school_score order by score desc
         ) group by course

在这里插入图片描述

🏆总结

Clickhouse数组函数可以实现开窗函数语法，但是局限性比较大

👍实际开发使用

🎉 建表语句

drop table if exists tb_newhouse;
create table tb_newhouse(
name String,
cdata Date,
newhouse_count Float64
)engine=ReplacingMergeTree()
order by (name,cdata);

🧀导入数据

使用客户端导入文件的形式导入数据

clickhouse-client -q 'insert into tb_newhouse format CSV' < newhouse.csv

🍖 应用【取连续日期最长的日期】

在这里插入图片描述

分组变数组

select name,groupArray(cdata) from tb_newhouse group by name;

在这里插入图片描述

取下标，为了计算是否连续

select name,groupArray(cdata) as cdataArray,arrayEnumerate(cdataArray) from tb_newhouse group by name;

在这里插入图片描述

拆开数组

select name,CDATA,CDATAIndex  from
(select name,groupArray(cdata) as cdataArray,arrayEnumerate(cdataArray) as cdataArrayIndex from tb_newhouse group by name)
array join cdataArray AS CDATA,
cdataArrayIndex as CDATAIndex;

在这里插入图片描述

计算是否连续（连续-连续=相等）

-- 取是否连续
select name,CDATA,CDATAIndex,(CDATA-CDATAIndex) as diff  from
(select name,groupArray(cdata) as cdataArray,arrayEnumerate(cdataArray) as cdataArrayIndex from tb_newhouse group by name)
array join cdataArray AS CDATA,
cdataArrayIndex as CDATAIndex;

取开始日期，连续的天数

-- 取开始日期，连续的天数
select name,count(1),diff from
(select name,CDATA,CDATAIndex,(CDATA-CDATAIndex) as diff  from
(select name,groupArray(cdata) as cdataArray,arrayEnumerate(cdataArray) as cdataArrayIndex from tb_newhouse group by name)
array join cdataArray AS CDATA,
cdataArrayIndex as CDATAIndex)
group by name,diff;

在这里插入图片描述

总天数倒序，按城市分组，取每个城市的连续天数最大的一天

--  总天数倒序，按城市分组，取每个城市的连续天数最大的一天
select * from (
select name,count(1) as sort,diff from
(select name,CDATA,CDATAIndex,(CDATA-CDATAIndex) as diff  from
(select name,groupArray(cdata) as cdataArray,arrayEnumerate(cdataArray) as cdataArrayIndex from tb_newhouse group by name)
array join cdataArray AS CDATA,
cdataArrayIndex as CDATAIndex)
group by name,diff
                  )
order by sort desc
limit 1 by name;

在这里插入图片描述

Abner G

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
2
评论
Clickhouse--数组函数

Clickhouse--数组函数☀️前言⭐️ SQL冗余诟病✨ arrayJoin????arrayEnumerate???? 理解建表语句???? 总结????arrayEnumerateDense???? 总结????arrayEnumerateUniq???? 理解????总结????实际开发使用☀️前言日常工作可能会遇到数据库一对多的字段，被用逗号或者其他特殊字符分割，存在一起。类似：如果要查询‘信息技术’标签的数据，SQL会很麻烦。下文做具体实现。⭐️ SQL冗余诟病以Mysql为
复制链接

扫一扫