Clickhouse数据库部署、Python3压测实践

玉言心

于 2023-10-26 16:13:42 发布

阅读量592

点赞数

分类专栏： Python基础数据存储技术文章标签： clickhouse 数据库压测 python

本文链接：https://blog.csdn.net/weixin_43563169/article/details/134058827

版权

Python基础同时被 2 个专栏收录

3 篇文章 1 订阅

订阅专栏

数据存储技术

1 篇文章 0 订阅

订阅专栏

Clickhouse数据库部署、Python3压测实践

一、Clickhouse数据库部署

版本：yandex/clickhouse-server:latest
部署方式：docker

内容

version: "3"

services:
  clickhouse:
    image: yandex/clickhouse-server:latest
    container_name: clickhouse    
    ports:
      - "8123:8123"
      - "9000:9000"
      - "9009:9009"
      - "9004:9004"
    volumes:
      - ./data/config:/var/lib/clickhouse
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]
      interval: 30s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4096M
        reservations:
          memory: 4096M

建表语句

CREATE TABLE test_table (id int,
    feild1 String, feild2 String, feild3 String
    , feild4 String, feild5 String, feild6 String
    , feild7 String, feild8 String, feild9 String
    , feild10 String, feild11 String, feild12 String
    , feild13 String, feild14 String, feild15 String
    , feild16 String, feild17 String, feild18 String
    , feild19 String, feild20 String
    ) ENGINE = MergeTree：

二、Python3插入数据压测

关键库：clickhouse_driver、 concurrent.futures

代码：

import random
import time
from clickhouse_driver import Client
from concurrent.futures import ThreadPoolExecutor, as_completed


client = Client(host='ip')

# 采用多个连接，避免单个连接被打死
clients = [
    Client(host='ip'),
    Client(host='ip'),
    Client(host='ip'),
    Client(host='ip')
]


# 采用批量插入，经过测试，单条并发插入支持差，每秒只能执行2-5次insert
def task(i):
    sql = "INSERT INTO ck_table (id, feild1, feild2,feild3,feild4,feild5,feild6,feild7,feild8,feild9,feild10,feild11,feild12,feild13,feild14,feild15,feild16,feild17,feild18,feild19,feild20) VALUES"
    values = []
    for i in range(1000):
        values.append((random.randint(1,10000000),"feild1-"+str((random.randint(1,10000000))),"feild2-"+str(i),"feild3-"+str(i), "feild4-"+str(i), "feild5-"+str(i), "feild6-"+str(i), "feild7-"+str(i)
                       , "feild8-"+str(i), "feild9-"+str(i), "feild10-"+str(i), "feild11-"+str(i), "feild12-"+str(i), "feild13-"+str(i), "feild14-"+str(i)
                       , "feild15-"+str(i), "feild16-"+str(i), "feild17-"+str(i), "feild18-"+str(i), "feild19-"+str(i)
                       , "feild20-"+str(i)
                       ))
    clid = random.randint(1, len(clients)-1)
    clients[clid].execute(sql, values)
    return '第',clid, "插入",i, '条数据成功'


if __name__ == '__main__':
    print ("程序开始运行")
    exec = ThreadPoolExecutor(max_workers=2)
    #ress = []
    start_time = time.perf_counter()
    for j in range(4000000):  # 总共需要执行的次数
        res = exec.submit(task,j)
        #ress.append(res)
    # for i in as_completed(ress):
    #     print("执行状态",i.result())
    print("执行耗时", time.perf_counter()-start_time,"s")

三、Python3查询数据测试

关键库：clickhouse_driver、 concurrent.futures

代码

import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from clickhouse_driver import Client

client = Client(host='10.10.16.110')

query_sql = """select * from ck_table where feild2='feild2-1009' """


def new_task(i):
    count_sql = """ select count(*) from ck_table"""
    time.sleep(1)
    return "执行第",i,"个任务",client.execute(count_sql)


if __name__ == '__main__':
    print ("程序开始运行")
    thd_ques = []
    exec = ThreadPoolExecutor(max_workers=1)
    ress = []
    start_time = time.perf_counter()
    for j in range(1000):
        res = exec.submit(new_task,j)
        ress.append(res)
    for i in as_completed(ress):
        print("执行状态",i.result())
    print("执行耗时", time.perf_counter()-start_time,"s")

四、测试结论

clickhouse：21个字段表插入-查询测试, CPU200w数据以内 >100,峰值：133.6，均值：约110

1、不支持频繁插入（一般1-2次/s），否则会断联等报错，只能批插入（脚本使用2协程每次1000条没有报错，2个协程或者以上会出现断联等报错）

2、不支持频发查询，QPS官方建议100以内，否则CPU占用会很高，拉高服务器负载

3、查询效率：

一个条件where查询(Memery)：60W 0.33s

5个条件where查询(Memery)：80W 0.57s

5个条件where查询(Memery)：100W 0.54s

5个条件where查询(Memery)：112W 0.56s

5个条件where查询(Memery)：200W 0.565s

5个条件where查询(Memery)：500W 1.2s(停止插入的情况下)

5个条件where查询(Memery)：560W 1.97s(停止插入的情况下）

5个条件where查询(TinyLog)：7000W条 1分47秒

2个条件where查询(TinyLog)：1亿零460万条 89s

5个条件where查询(TinyLog)：1亿零460万条 84s

10个条件where查询(TinyLog)：1亿零460万条 87s

备注 450w条数据后，数据插入线程和查询线程只能存在一个，慢查询的内存消耗很高，16G内存不够用。5个条件where查询还能执行，在1-2s

（1）500w数据量服务器情况：（COPU均值在320左右，16G内存剩余在500-800M之间，停止写入/查询后，CPU恢复正常水平，内存剩余在800M左右）

total used free shared buff/cache available

15G 5.9G 519M 9.2M 9.1G 9.2G

%CPU %MEM

429.5 26.0

（2）1亿数据量服务器情况（1T磁盘消耗共38%，预计消耗6% ）

total used free shared buff/cache available

15G 2.7G 181M 9.2M 12G 12G

%CPU %MEM

103.7 3.6

总结：

1、不支持并发单条频繁插入，否则会报错，断联等造成数据丢失
2、不支持高并发查询，官方建议QPS<= 100，否则会增加服务器负载，CPU，内存等消耗过高
3、对服务器要求高，亿级CPU一般建议16核心以上，内存64G以上
4、优点是查询快，批量插入效率高，建议低频大批量插入