ClickHouse数据字典（很详细）

最新推荐文章于 2024-10-05 22:46:51 发布

Machine4869

最新推荐文章于 2024-10-05 22:46:51 发布

阅读量5.5k

点赞数 5

分类专栏：大数据文章标签：大数据实时大数据

本文链接：https://blog.csdn.net/Machine4869/article/details/121381370

版权

数据字典

参考：《ClickHouse原理解析与应用实践》第5章、第8章8.4.2节

数据字典是clickhouse提供的一种简单实用的存储媒介，以键值和属性映射的形式定义数据。字典中的数据会主动或被动加载到内存之中，并支持动态更新。由于字典数据常驻内存特特性，比较适合保存常量或者经常使用的维度表数据，以避免不必要的JOIN数据。

数据字典分为内置和扩展两种形式，内置数据字典是以clickhouse默认自带的字典；外部字典是通过用户自定义配置实现的字典。

在正常情况下，字典中的数据只能通过字典函数访问。但是也有一种例外，那就是使用特殊的字典表引擎。在字典表引擎的帮助下，可以将数据字典挂载到一张代理的数据表下，从而实现数据表与字典数据的JOIN查询。

内置字典

ClickHouse目前只有一种内置字典——Yandex.Metrica字典。（ClickHouse目前的内置字典，只是提供了字典的定义机制和取数函数，需要遵照它的字典规范自行导入数据。）

1、开启内置字典

内置字典在默认的情况下是禁用状态，需要开启后才能使用

将config.xml文件中path_to_regions_hierarchy_file和 path_to_regions_names_files两项配置打开。

<path_to_regions_hierarchy_file>/opt/geo/regions_hierarchy.txt</path_to_regions_hierarchy_file> 
<path_to_regions_names_files>/opt/geo/</path_to_regions_names_files>

2、导入数据

将下列用于测试的数据文件复制到刚才已经建好的/opt/geo目录下

[machine@hadoop104 geo]$ pwd
/opt/geo
[machine@hadoop104 geo]$ ll
总用量 36
-rw-rw-r-- 1 machine machine 3096 11月 12 18:04 regions_hierarchy_ru.txt
-rw-rw-r-- 1 machine machine 3096 11月 12 18:04 regions_hierarchy.txt
-rw-rw-r-- 1 machine machine 3957 11月 12 18:04 regions_names_ar.txt
-rw-rw-r-- 1 machine machine 3957 11月 12 18:04 regions_names_by.txt
-rw-rw-r-- 1 machine machine 3957 11月 12 18:04 regions_names_en.txt
-rw-rw-r-- 1 machine machine 3957 11月 12 18:04 regions_names_kz.txt
-rw-rw-r-- 1 machine machine 3957 11月 12 18:04 regions_names_ru.txt
-rw-rw-r-- 1 machine machine 3957 11月 12 18:04 regions_names_tr.txt
-rw-rw-r-- 1 machine machine 3957 11月 12 18:04 regions_names_ua.txt

3、重启clickhouse-server

4、访问字典中的数据

hadoop104 :) SELECT regionToName(toUInt32(20009));

SELECT regionToName(toUInt32(20009))

Query id: abf6714b-843e-4741-807d-46e27b3cdf0e

┌─regionToName(toUInt32(20009))─┐
│ Buenos Aires Province         │
└───────────────────────────────┘

类似regionToName这样的函数，在ClickHouse中它们被称为Yandex.Metrica函数。

外部扩展字典-准备数据

外部扩展字典是以插件形式注册到ClickHouse中的，由用户自行定义数据模式及数据来源。目前扩展字典支持7种类型的内存布局和4 类数据来源。相比内容十分有限的内置字典，扩展字典才是更加常用的功能。

[machine@hadoop104 testdata]$ pwd
/opt/module/datas/testdata
[machine@hadoop104 testdata]$ ll
总用量 12
-rw-rw-r-- 1 machine machine 164 11月 12 22:39 asn.csv
-rw-rw-r-- 1 machine machine 162 11月 12 22:39 organization.csv
-rw-rw-r-- 1 machine machine 233 11月 12 22:39 sales.csv

[machine@hadoop104 testdata]$ cat asn.csv 
"82.118.230.0/24","AS42831","GB"
"148.163.0.0/17","AS53755","US"
"178.93.0.0/18","AS6849","UA"
"200.69.95.0/24","AS262186","CO"
"154.9.160.0/20","AS174","US"

[machine@hadoop104 testdata]$ cat organization.csv 
1,"a0001","研发部"
2,"a0002","产品部"
3,"a0003","数据部"
4,"a0004","测试部"
5,"a0005","运维部"
6,"a0006","规划部"
7,"a0007","市场部"

[machine@hadoop104 testdata]$ cat sales.csv 
1,2016-01-01,2017-01-10,100
2,2016-05-01,2017-07-01,200
3,2014-03-05,2018-01-20,300
4,2018-08-01,2019-10-01,400
5,2017-03-01,2017-06-01,500
6,2017-04-09,2018-05-30,600
7,2018-06-01,2019-01-25,700
8,2019-08-01,2019-12-12,800

扩展字典配置文件的元素组成

扩展字典的配置文件由config.xml文件中的dictionaries_config 配置项指定:

<dictionaries_config>*_dictionary.xml</dictionaries_config>

在默认的情况下，ClickHouse会自动识别并加载/etc/clickhouse-server目录下所有以_dictionary.xml结尾的配置文件。同时ClickHouse也能够动态感知到此目录下配置文件的各种变化，并支持不停机在线更新配置文件。

它们完整的配置结构如下所示

<?xml version="1.0"?> 
<dictionaries>
       <dictionary>
              <name>
                     dict_name
              </name>
              <structure> 
                     <!—字典的数据结构 --> 
              </structure>
              <layout> 
                     <!—在内存中的数据格式类型 --> 
              </layout>
              <source> 
                     <!—数据源配置 --> 
              </source>
              <lifetime> 
                     <!—字典的自动更新频率 --> 
              </lifetime>
       </dictionary>
</dictionaries>

7种扩展字典类型的配置方法

扩展字典的类型使用layout元素定义，目前共有7种类型。一个字典的类型，既决定了其数据在内存中的存储结构，也决定了该字典支持的key键类型。根据key键类型的不同，可以将它们划分为两类:

单数值key类型（flat、 hashed、range_hashed和cache）
复合key类型（complex_key_hashed、complex_key_cache和 ip_trie）。

1.flat

flat字典是所有类型中性能最高的字典类型，它只能使用UInt64数值型 key。顾名思义，flat字典的数据在内存中使用数组结构保存，数组的初始大小为1024，上限为500 000，这意味着它最多只能保存500 000行数据。如果在创建字典时数据量超出其上限，那么字典会创建失败。

/etc/clickhouse- server/test_flat_dictionary.xml

<?xml version="1.0"?>
<dictionaries>
       <dictionary>
              <name>test_flat_dict</name>
              <source>
                     <!-- 准备好的测试数据 -->
                     <file>
                            <path>/opt/module/datas/testdata/organization.csv</path>
                            <format>CSV</format>
                     </file>
              </source>
              <layout>
                     <flat/>
              </layout>
              <!-- 与测试数据的结构对应 -->
              <structure>
                     <id>
                            <name>id</name>
                     </id>
                     <attribute>
                            <name>code</name>
                            <type>String</type>
                            <null_value/>
                     </attribute>
                     <attribute>
                            <name>name</name>
                            <type>String</type>
                            <null_value/>
                     </attribute>
              </structure>
              <lifetime>
                     <min>300</min>
                     <max>360</max>
              </lifetime>
       </dictionary>
</dictionaries>

查验system.dictionaries 系统表后，能够看到flat字典已经创建成功

hadoop104 :) SELECT name, type, key, attribute.names, attribute.types FROM system.dictionaries;

SELECT
    name,
    type,
    key,
    attribute.names,
    attribute.types
FROM system.dictionaries

Query id: 24e3570b-cc74-486a-bf1c-529900ad6e7e

┌─name───────────┬─type─┬─key────┬─attribute.names─┬─attribute.types─────┐
│ test_flat_dict │      │ UInt64 │ ['code','name'] │ ['String','String'] │
└────────────────┴──────┴────────┴─────────────────┴─────────────────────┘

2.hashed

hashed字典同样只能够使用UInt64数值型key，但与flat字典不同的是，hashed字典的数据在内存中通过散列结构保存，且没有存储上限的制约。

test_hashed_dictionary.xml

<?xml version="1.0"?>
<dictionaries>
    <dictionary>
        <name>test_hashed_dict</name>
        <source>
            <file>

最低0.47元/天解锁文章