学习clickhouse 集群搭建和分布式存储

为什么要用集群

使用集群的主要原因是为了提高系统的可扩展性、可用性和容错性。

  1. 可扩展性:当单个节点无法处理增加的负载时,可以通过添加更多的节点到集群来增加处理能力。这使得系统可以处理更大的数据量和更高的查询负载。
  2. 可用性:在集群中,如果一个节点出现故障,其他节点可以接管它的工作,从而保证系统的持续运行。这对于需要24/7运行的系统来说非常重要。
  3. 容错性:集群可以通过数据复制和故障转移来提高系统的容错性。如果一个节点出现故障,系统可以从其他节点获取数据,从而避免数据丢失。

为什么要分布式存储

使用分布式存储的主要原因有以下几点:

  1. 可扩展性:分布式存储系统可以通过添加更多的节点来扩展存储容量和处理能力,这使得它们可以处理大规模的数据。
  2. 高可用性:在分布式存储系统中,数据通常会在多个节点上进行复制,这意味着如果一个节点出现故障,其他节点可以提供数据,从而保证系统的可用性。
  3. 容错性:分布式存储系统通过数据复制和故障转移来提高系统的容错性。如果一个节点出现故障,系统可以从其他节点获取数据,从而避免数据丢失。
  4. 性能:分布式存储系统可以通过在多个节点上并行处理数据来提高性能。

那么今天我们今天用docker的形式搭建一个clickhouse集群来提供我们学习用。

dockerfile

docker-compose.yml

version: "3.7"
 
services:
  ck1:
    image: yandex/clickhouse-server
    ulimits:
      nofile:
        soft: 300001
        hard: 300002
    ports:
      - 9001:9000
    volumes:
      - ./conf/config.xml:/etc/clickhouse-server/config.xml
      - ./conf/users.xml:/etc/clickhouse-server/users.xml
      - ./conf/metrika1.xml:/etc/metrika.xml
    links:
      - "zk1"
    depends_on:
      - zk1
 
  ck2:
    image: yandex/clickhouse-server
    ulimits:
      nofile:
        soft: 300001
        hard: 300002
    volumes:
      - ./conf/metrika2.xml:/etc/metrika.xml
      - ./conf/config.xml:/etc/clickhouse-server/config.xml
      - ./conf/users.xml:/etc/clickhouse-server/users.xml
    ports:
      - 9002:9000
    depends_on:
      - zk1
 
  ck3:
    image: yandex/clickhouse-server
    ulimits:
      nofile:
        soft: 300001
        hard: 300002
    volumes:
      - ./conf/metrika3.xml:/etc/metrika.xml
      - ./conf/config.xml:/etc/clickhouse-server/config.xml
      - ./conf/users.xml:/etc/clickhouse-server/users.xml
    ports:
      - 9003:9000
    depends_on:
      - zk1
 
  zk1:
    image: zookeeper:3.5
    restart: always
    hostname: zk1
    expose:
      - "2181"
    ports:
      - 2181:2181

注意:这里clickhouse的版本要和zookeeper的版本相互匹配上,不然后续就会有兼容性的问题。我的clickhouse的版本是22.1.3 所以我的zookeeper 的版本是3.5

配置

  1. user.xml
<?xml version="1.0"?>
<clickhouse>
    <!-- Profiles of settings. -->
    <profiles>
        <!-- Default settings. -->
        <default>
            <!-- Maximum memory usage for processing single query, in bytes. -->
            <max_memory_usage>10000000000</max_memory_usage>


            <!-- How to choose between replicas during distributed query processing.
                 random - choose random replica from set of replicas with minimum number of errors
                 nearest_hostname - from set of replicas with minimum number of errors, choose replica
                  with minimum number of different symbols between replica's hostname and local hostname
                  (Hamming distance).
                 in_order - first live replica is chosen in specified order.
                 first_or_random - if first replica one has higher number of errors, pick a random one from replicas with minimum number of errors.
            -->
            <load_balancing>random</load_balancing>
            <allow_ddl>1</allow_ddl>
            <readonly>0</readonly>
        </default>

        <!-- Profile that allows only read queries. -->
        <readonly>
            <readonly>1</readonly>
        </readonly>
    </profiles>

    <!-- Users and ACL. -->
    <users>
        <!-- If user name was not specified, 'default' user is used. -->
        <default>
                <access_management>1</access_management>
            <password></password>

            <networks>
                <ip>::/0</ip>
            </networks>

            <!-- Settings profile for user. -->
            <profile>default</profile>

            <!-- Quota for user. -->
            <quota>default</quota>

            <!-- User can create other users and grant rights to them. -->
            <!-- <access_management>1</access_management> -->
        </default>
        <test>
            <password></password>
            <quota>default</quota>
            <profile>default</profile>
            <allow_databases>
                <database>default</database>
                <database>test_dictionaries</database></allow_databases>
            <allow_dictionaries>
                <dictionary>replicaTest_all</dictionary>
            </allow_dictionaries>
        </test>
    </users>

    <!-- Quotas. -->
    <quotas>
        <!-- Name of quota. -->
        <default>
            <!-- Limits for time interval. You could specify many intervals with different limits. -->
            <interval>
                <!-- Length of interval. -->
                <duration>3600</duration>

                <!-- No limits. Just calculate resource usage for time interval. -->
                <queries>0</queries>
                <errors>0</errors>
                <result_rows>0</result_rows>
                <read_rows>0</read_rows>
                <execution_time>0</execution_time>
            </interval>
        </default>
    </quotas>
</clickhouse>
  1. config.xml
<?xml version="1.0"?>
<!--
  NOTE: User and query level settings are set up in "users.xml" file.
  If you have accidentally specified user-level settings here, server won't start.
  You can either move the settings to the right place inside "users.xml" file
   or add <skip_check_for_incorrect_settings>1</skip_check_for_incorrect_settings> here.
-->
<clickhouse>
    <logger>
        <level>trace</level>
        <log>/var/log/clickhouse-server/clickhouse-server.log</log>
        <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
        <size>1000M</size>
        <count>10</count>
    </logger>

    <http_port>8123</http_port>
    <tcp_port>9000</tcp_port>
    <mysql_port>9004</mysql_port>
    <postgresql_port>9005</postgresql_port>
    <interserver_http_port>9009</interserver_http_port>
    <max_connections>4096</max_connections>
    <keep_alive_timeout>3</keep_alive_timeout>
    <grpc>
        <enable_ssl>false</enable_ssl>
        <ssl_cert_file>/path/to/ssl_cert_file</ssl_cert_file>
        <ssl_key_file>/path/to/ssl_key_file</ssl_key_file>
        <ssl_require_client_auth>false</ssl_require_client_auth>
        <ssl_ca_cert_file>/path/to/ssl_ca_cert_file</ssl_ca_cert_file>
        <compression>deflate</compression>
        <compression_level>medium</compression_level>
        <max_send_message_size>-1</max_send_message_size>
        <max_receive_message_size>-1</max_receive_message_size>
        <verbose_logs>false</verbose_logs>
    </grpc>
    <openSSL>
        <server>
            <certificateFile>/etc/clickhouse-server/server.crt</certificateFile>
            <privateKeyFile>/etc/clickhouse-server/server.key</privateKeyFile>
            <dhParamsFile>/etc/clickhouse-server/dhparam.pem</dhParamsFile>
            <verificationMode>none</verificationMode>
            <loadDefaultCAFile>true</loadDefaultCAFile>
            <cacheSessions>true</cacheSessions>
            <disableProtocols>sslv2,sslv3</disableProtocols>
            <preferServerCiphers>true</preferServerCiphers>
        </server>

        <client> 
            <loadDefaultCAFile>true</loadDefaultCAFile>
            <cacheSessions>true</cacheSessions>
            <disableProtocols>sslv2,sslv3</disableProtocols>
            <preferServerCiphers>true</preferServerCiphers>
            <invalidCertificateHandler>
                <name>RejectCertificateHandler</name>
            </invalidCertificateHandler>
        </client>
    </openSSL>
    <max_concurrent_queries>100</max_concurrent_queries>
    <max_server_memory_usage>0</max_server_memory_usage>
    <max_thread_pool_size>10000</max_thread_pool_size>
    <max_server_memory_usage_to_ram_ratio>0.9</max_server_memory_usage_to_ram_ratio>
    <total_memory_profiler_step>4194304</total_memory_profiler_step>

    <total_memory_tracker_sample_probability>0</total_memory_tracker_sample_probability>
    <uncompressed_cache_size>8589934592</uncompressed_cache_size>
    <mark_cache_size>5368709120</mark_cache_size>
    <mmap_cache_size>1000</mmap_cache_size>
    <compiled_expression_cache_size>134217728</compiled_expression_cache_size>
    <compiled_expression_cache_elements_size>10000</compiled_expression_cache_elements_size>
    <path>/var/lib/clickhouse/</path>
    <tmp_path>/var/lib/clickhouse/tmp/</tmp_path>
    <user_files_path>/var/lib/clickhouse/user_files/</user_files_path>
    <ldap_servers>
    </ldap_servers>
    <user_directories>
        <users_xml>
            <path>users.xml</path>
        </users_xml>
        <local_directory>
            <path>/var/lib/clickhouse/access/</path>
        </local_directory>
    </user_directories>
    <default_profile>default</default_profile>
    <custom_settings_prefixes></custom_settings_prefixes>
    <default_database>default</default_database>
    <mlock_executable>true</mlock_executable>
    <remap_executable>false</remap_executable>

    <![CDATA[
         Uncomment below in order to use JDBC table engine and function.

         To install and run JDBC bridge in background:
         * [Debian/Ubuntu]
           export MVN_URL=https://repo1.maven.org/maven2/ru/yandex/clickhouse/clickhouse-jdbc-bridge
           export PKG_VER=$(curl -sL $MVN_URL/maven-metadata.xml | grep '<release>' | sed -e 's|.*>\(.*\)<.*|\1|')
           wget https://github.com/ClickHouse/clickhouse-jdbc-bridge/releases/download/v$PKG_VER/clickhouse-jdbc-bridge_$PKG_VER-1_all.deb
           apt install --no-install-recommends -f ./clickhouse-jdbc-bridge_$PKG_VER-1_all.deb
           clickhouse-jdbc-bridge &

         * [CentOS/RHEL]
           export MVN_URL=https://repo1.maven.org/maven2/ru/yandex/clickhouse/clickhouse-jdbc-bridge
           export PKG_VER=$(curl -sL $MVN_URL/maven-metadata.xml | grep '<release>' | sed -e 's|.*>\(.*\)<.*|\1|')
           wget https://github.com/ClickHouse/clickhouse-jdbc-bridge/releases/download/v$PKG_VER/clickhouse-jdbc-bridge-$PKG_VER-1.noarch.rpm
           yum localinstall -y clickhouse-jdbc-bridge-$PKG_VER-1.noarch.rpm
           clickhouse-jdbc-bridge &

         Please refer to https://github.com/ClickHouse/clickhouse-jdbc-bridge#usage for more information.
    ]]>
    <remote_servers incl="clickhouse_remote_servers">
        <test_unavailable_shard>
            <shard>
                <replica>
                    <host>localhost</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <replica>
                    <host>localhost</host>
                    <port>1</port>
                </replica>
            </shard>
        </test_unavailable_shard>
    </remote_servers>

    <zookeeper incl="zookeeper-servers">
    </zookeeper>
    <builtin_dictionaries_reload_interval>3600</builtin_dictionaries_reload_interval>
    <max_session_timeout>3600</max_session_timeout>
    <default_session_timeout>60</default_session_timeout>
    <query_log>
        <database>system</database>
        <table>query_log</table>
        <partition_by>toYYYYMM(event_date)</partition_by>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </query_log>
    <trace_log>
        <database>system</database>
        <table>trace_log</table>

        <partition_by>toYYYYMM(event_date)</partition_by>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </trace_log>
    <query_thread_log>
        <database>system</database>
        <table>query_thread_log</table>
        <partition_by>toYYYYMM(event_date)</partition_by>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </query_thread_log>
    <query_views_log>
        <database>system</database>
        <table>query_views_log</table>
        <partition_by>toYYYYMM(event_date)</partition_by>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </query_views_log>
    <part_log>
        <database>system</database>
        <table>part_log</table>
        <partition_by>toYYYYMM(event_date)</partition_by>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </part_log>
    <metric_log>
        <database>system</database>
        <table>metric_log</table>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        <collect_interval_milliseconds>1000</collect_interval_milliseconds>
    </metric_log>
    <asynchronous_metric_log>
        <database>system</database>
        <table>asynchronous_metric_log</table>
        <flush_interval_milliseconds>7000</flush_interval_milliseconds>
    </asynchronous_metric_log>
    <opentelemetry_span_log>
        <engine>
            engine MergeTree
            partition by toYYYYMM(finish_date)
            order by (finish_date, finish_time_us, trace_id)
        </engine>
        <database>system</database>
        <table>opentelemetry_span_log</table>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </opentelemetry_span_log>
    <crash_log>
        <database>system</database>
        <table>crash_log</table>

        <partition_by />
        <flush_interval_milliseconds>1000</flush_interval_milliseconds>
    </crash_log>

    <session_log>
        <database>system</database>
        <table>session_log</table>

        <partition_by>toYYYYMM(event_date)</partition_by>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    </session_log>
    <top_level_domains_lists>
    </top_level_domains_lists>
    <dictionaries_config>*_dictionary.xml</dictionaries_config>
<user_defined_executable_functions_config>*_function.xml</user_defined_executable_functions_config>
    <encryption_codecs>
       
    </encryption_codecs>

    <distributed_ddl>
        <path>/clickhouse/task_queue/ddl</path>
    </distributed_ddl>
    <graphite_rollup_example>
        <pattern>
            <regexp>click_cost</regexp>
            <function>any</function>
            <retention>
                <age>0</age>
                <precision>3600</precision>
            </retention>
            <retention>
                <age>86400</age>
                <precision>60</precision>
            </retention>
        </pattern>
        <default>
            <function>max</function>
            <retention>
                <age>0</age>
                <precision>60</precision>
            </retention>
            <retention>
                <age>3600</age>
                <precision>300</precision>
            </retention>
            <retention>
                <age>86400</age>
                <precision>3600</precision>
            </retention>
        </default>
    </graphite_rollup_example>

    <format_schema_path>/var/lib/clickhouse/format_schemas/</format_schema_path>
    <query_masking_rules>
        <rule>
            <name>hide encrypt/decrypt arguments</name>
            <regexp>((?:aes_)?(?:encrypt|decrypt)(?:_mysql)?)\s*\(\s*(?:'(?:\\'|.)+'|.*?)\s*\)</regexp>
            <replace>\1(???)</replace>
        </rule>
    </query_masking_rules>



    <send_crash_reports>
        <enabled>false</enabled>
        <anonymize>false</anonymize>
        <endpoint>https://6f33034cfe684dd7a3ab9875e57b1c8d@o388870.ingest.sentry.io/5226277</endpoint>
    </send_crash_reports>
    <include_from>/etc/metrika.xml</include_from>
</clickhouse>
  1. metrika.xml
    这里我们3个metrika.xml都是一样的
<yandex>
  <clickhouse_remote_servers>
    <test_cluster>
      <shard>
        <internal_replication>true</internal_replication>
        <replica>
          <host>ck1</host>
          <port>9000</port>
        </replica>
      </shard>
      <shard>
        <internal_replication>true</internal_replication>
        <replica>
          <host>ck2</host>
          <port>9000</port>
        </replica>
      </shard>
      <shard>
        <internal_replication>true</internal_replication>
        <replica>
          <host>ck3</host>
          <port>9000</port>
        </replica>
      </shard>
    </test_cluster>
  </clickhouse_remote_servers>

  <zookeeper-servers>
    <node index="1">
      <host>zk1</host>
      <port>2181</port>
    </node>
  </zookeeper-servers>

</yandex>

我们将配置文件在conf文件夹里面创建好以后执行docker-compose up -d 就可以启动我们的集群了。

创建表

通过docker-compose ps 查看端口号
在这里插入图片描述
通过clickhouse-client --port 9001 连接数据库
在这里插入图片描述

创建数据表

分表在3台节点啥


CREATE TABLE access_logs
(
    `create_time` DateTime,
    `status` String,
    `runtime` String,
    `client_ip` String,
    `method` String,
    `url` String,
    `body` String,
    `response` String,
    `hander` String,
    `request_id` String
)
ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMM(create_time)
PRIMARY KEY request_id
ORDER BY (request_id, create_time)
TTL create_time + toIntervalHour(48)

找一台节点执行一下sql

CREATE TABLE access_logs_all ENGINE = Distributed(test_cluster, big_data_all, access_logs, rand());

第二条sql的意思,我们通过on cluster在每个节点上创建了一个分布式的表,该表指向了我们已经有的表access_logs。

test_cluster : 集群的名称
big_data_all: 远程数据库的名称
access_logs: 远程数据表的名称
rand(): 分片的逻辑,我们这里使用的是随机分配,还有hivehash 等。

Distributed Engine

ClickHouse 的 Distributed 引擎并不存储数据,而是允许在物理上分布在多个节点上的表之间进行查询。当一个查询被发送到 Distributed 表时,它将查询分发到所有的本地表,然后汇总结果。

  1. 当一个查询被发送到 Distributed 表时,ClickHouse 首先将查询分解为两部分:一个主查询和一个子查询。

  2. 子查询被发送到集群中的每个节点。每个节点在其本地表上执行子查询,并返回结果。

  3. 主查询在 Distributed 表所在的节点上执行。它从所有的子查询中收集结果,然后汇总这些结果。

  4. 如果查询包含聚合函数,如 SUM()COUNT(),那么 ClickHouse 将在每个节点上进行部分聚合,然后在 Distributed 表所在的节点上进行最终聚合。

  5. 如果查询包含 ORDER BY 子句,那么排序将在每个节点上进行,然后在 Distributed 表所在的节点上进行最终排序。

通过这种方式,Distributed 引擎可以在多个节点上并行执行查询,从而提高查询性能。同时,由于数据是在多个节点上存储的,因此可以提供更大的存储容量。

在 ClickHouse 中,你可以直接向 Distributed 表插入数据。当你向 Distributed 表插入数据时,ClickHouse 会根据分片键将数据分发到相应的分片上。

  • 7
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值