clickhouse之HDFS引擎（支持kerberos环境）

最新推荐文章于 2023-09-17 10:30:00 发布

普普通通程序猿

最新推荐文章于 2023-09-17 10:30:00 发布

阅读量2.2k

点赞数

分类专栏： clickhouse 大数据文章标签：大数据数据库安全

原文链接：https://clickhouse.tech/docs/en/engines/table-engines/integrations/hdfs/

版权

clickhouse 同时被 2 个专栏收录

30 篇文章 10 订阅

订阅专栏

大数据

14 篇文章 0 订阅

订阅专栏

翻译自官网文档：https://clickhouse.tech/docs/en/engines/table-engines/integrations/hdfs/

文章目录

Clickhouse通过HDFS引擎可以实现对HDFS上数据的管理，从而实现了与Apache Hadoop生态圈的集成。该引擎和File以及URL类型的引擎十分相似，不同之处在于提供了一些Hadoop相关的功能。

用法

ENGINE = HDFS(URI, format)

URI参数为访问HDFS的URI全路径，该路径可以包含通配符，但这种情况要求这张表是只读的。
format参数则指定一种可用的文本格式。如果想执行SELECT查询，那么需要选择支持输入类型的格式；如果要执行INSERT操作，则选择输出类型的格式。

举例

建立一张HDFS引擎表 hdfs_engine_table

CREATE TABLE hdfs_engine_table (name String, value UInt32) ENGINE=HDFS('hdfs://hdfs1:9000/other_storage', 'TSV')

写入数据

INSERT INTO hdfs_engine_table VALUES ('one', 1), ('two', 2), ('three', 3)

查询数据

SELECT * FROM hdfs_engine_table LIMIT 2

┌─name─┬─value─┐
│ one  │     1 │
│ two  │     2 │
└──────┴───────┘

实现细节

读和写可以并行
支持零拷贝复制
不支持
- ALTER和SELECT...SAMPLE操作
- 索引

通配符

当有多个路径的时候可以使用通配符。要处理的文件应该存在且能够匹配上通配符表达式。一般是在执行SELECT的时候而不是建表的时候决定要列出哪些文件。

* - 代表任意数量的字符除了/
? - 代表任意一个字符
{some_string,another_string,yet_another_one} - 代替some_string,another_string,yet_another_one这三个字符串中的任意一个
{N..M} - 代表从N到M范围内的任意数，包含边界值

举例

假设我们在HDFS上有若干个TSV格式的文件，它们的URI路径为：

‘hdfs://hdfs1:9000/some_dir/some_file_1’
‘hdfs://hdfs1:9000/some_dir/some_file_2’
‘hdfs://hdfs1:9000/some_dir/some_file_3’
‘hdfs://hdfs1:9000/another_dir/some_file_1’
‘hdfs://hdfs1:9000/another_dir/some_file_2’
‘hdfs://hdfs1:9000/another_dir/some_file_3’

有几种办法来创建一张包含这6个文件的表：
方法1：

CREATE TABLE table_with_range (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_{1..3}', 'TSV')

方法2：

CREATE TABLE table_with_question_mark (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_?', 'TSV')

如果我想匹配该路径下的所有文件，则使用如下方法：

CREATE TABLE table_with_asterisk (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/*', 'TSV')

注意：如果罗列的文件中包含数字是以0开头的，那么需要每一位都分别用{}表达式

例子
建一张包含若干文件的表，这些文件的命名方式为file000, file001, … , file999:

CREATE TABLE big_table (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/big_dir/file{0..9}{0..9}{0..9}', 'CSV')

配置

与GraphiteMergeTree类似，HDFS表引擎支持通过clickhouse的配置文件来扩展配置。有两个相关配置的key，一个是全局的(hdfs)，还有一个是用户级的(hdfs_*)。首先加载全局配置，然后加载用户级别的配置（如果存在的话）。

   <!-- Global configuration options for HDFS engine type -->
  <hdfs>
    <hadoop_kerberos_keytab>/tmp/keytab/clickhouse.keytab</hadoop_kerberos_keytab>
    <hadoop_kerberos_principal>clickuser@TEST.CLICKHOUSE.TECH</hadoop_kerberos_principal>
    <hadoop_security_authentication>kerberos</hadoop_security_authentication>
  </hdfs>

  <!-- Configuration specific for user "root" -->
  <hdfs_root>
    <hadoop_kerberos_principal>root@TEST.CLICKHOUSE.TECH</hadoop_kerberos_principal>
  </hdfs_root>

配置选项

通过libhdfs3库来支持。

parameter	default value
rpc_client_connect_tcpnodelay	true
dfs_client_read_shortcircuit	true
output_replace-datanode-on-failure	true
input_notretry-another-node	false
input_localread_mappedfile	true
dfs_client_use_legacy_blockreader_local	false
rpc_client_ping_interval	10 * 1000
rpc_client_connect_timeout	600 * 1000
rpc_client_read_timeout	3600 * 1000
rpc_client_write_timeout	3600 * 1000
rpc_client_socekt_linger_timeout	-1
rpc_client_connect_retry	10
rpc_client_timeout	3600 * 1000
dfs_default_replica	3
input_connect_timeout	600 * 1000
input_read_timeout	3600 * 1000
input_write_timeout	3600 * 1000
input_localread_default_buffersize	1 * 1024 * 1024
dfs_prefetchsize	10
input_read_getblockinfo_retry	3
input_localread_blockinfo_cachesize	1000
input_read_max_retry	60
output_default_chunksize	512
output_default_packetsize	64 * 1024
output_default_write_retry	10
output_connect_timeout	600 * 1000
output_read_timeout	3600 * 1000
output_write_timeout	3600 * 1000
output_close_timeout	3600 * 1000
output_packetpool_size	1024
output_heeartbeat_interval	10 * 1000
dfs_client_failover_max_attempts	15
dfs_client_read_shortcircuit_streams_cache_size	256
dfs_client_socketcache_expiryMsec	3000
dfs_client_socketcache_capacity	16
dfs_default_blocksize	64 * 1024 * 1024
dfs_default_uri	“hdfs://localhost:9000”
hadoop_security_authentication	“simple”
hadoop_security_kerberos_ticket_cache_path	“”
dfs_client_log_severity	“INFO”
dfs_domain_socket_path	“”

Clickhouse额外配置

parameter	default value
hadoop_kerberos_keytab	“”
hadoop_kerberos_principal	“”
hadoop_kerberos_kinit_command	kinit

限制条件：`hadoop_security_kerberos_ticket_cache_path只能在全局选项中使用，不能在用户级别的选项中使用

kerberos支持

只要hadoop_security_authentication 这个参数的值为kerberos，那么clickhouse就具备了访问kerberos环境的能力，其它需要配置的选项：hadoop_kerberos_keytab、hadoop_kerberos_principal，见前面的例子。

译者注：kerberos的支持是从21.1版本之后开始的，如果想使用此功能，建议升级到高于21.1的版本。

普普通通程序猿

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
4
评论
clickhouse之HDFS引擎（支持kerberos环境）

翻译自官网文档：https://clickhouse.tech/docs/en/engines/table-engines/integrations/hdfs/文章目录用法实现细节通配符举例配置配置选项kerberos支持Clickhouse通过HDFS引擎可以实现对HDFS上数据的管理，从而实现了与Apache Hadoop生态圈的集成。该引擎和File以及URL类型的引擎十分相似，不同之处在于提供了一些Hadoop相关的功能。用法ENGINE = HDFS(URI, format)UR.
复制链接

扫一扫