Citus分布式方案(四)- 实时数据处理

(四)实时数据处理

Citus提供对大型数据集的实时查询,以监控HTTP流量为例。每当客户端接收到HTTP请求时,服务就会接收到日志记录。应用程序会消化所有这些记录,并创建一个HTTP分析仪表板,如HTTP错误的数量等。我们关心的是,这些数据应该以尽可能少的延迟显示。

数据模型

我们将模拟一个结构不变的日志数据流,并将其直接插入到Citus中。实际的生产场景中,会首先通过Kafka之类的东西路由这些数据,以便在高并发的情况下,可以更容易地预先聚合数据。

1. 创建关系表

使用一个简单的模型来模拟摄取HTTP事件数据。协调节点上执行:

CREATE TABLE http_request (
  site_id INT,
  ingest_time TIMESTAMPTZ DEFAULT now(),

  url TEXT,
  request_country TEXT,
  ip_address TEXT,

  status_code INT,
  response_time_msec INT
);

SELECT create_distributed_table('http_request', 'site_id');

create_distributed_table函数对切分计数使用默认配置值,Citus建议在集群中使用2-4倍的CPU内核分片数。通常,部署Citus使用流复制来实现高可用性,因此维护分片的副本是多余的。在流复制不可用的生产环境中,则需要设置citus的shard_replication_factor为2或更高,以实现容错。

2. 模拟数据流

后台的psql控制台中运行以下循环,模拟每隔一到两秒产生日志数据:

DO $$
  BEGIN LOOP
    INSERT INTO http_request (
      site_id, ingest_time, url, request_country,
      ip_address, status_code, response_time_msec
    ) VALUES (
      trunc(random()*32), clock_timestamp(),
      concat('http://example.com/', md5(random()::text)),
      ('{China,India,USA,Indonesia}'::text[])[ceil(random()*4)],
      concat(
        trunc(random()*250 + 2), '.',
        trunc(random()*250 + 2), '.',
        trunc(random()*250 + 2), '.',
        trunc(random()*250 + 2)
      )::inet,
      ('{200,404}'::int[])[ceil(random()*2)],
      5+trunc(random()*150)
    );
    COMMIT;
    PERFORM pg_sleep(random() * 0.25);
  END LOOP;
END $$;

运行以下SQL查询数据入库情况:

SELECT
  site_id,
  date_trunc('minute', ingest_time) as minute,
  COUNT(1) AS request_count,
  SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
  SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
  SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
FROM http_request
WHERE date_trunc('minute', ingest_time) > now() - '5 minutes'::interval
GROUP BY site_id, minute
ORDER BY minute ASC;

实时统计

对这些实时数据的统计,一般会遇到以下两个问题:

① 分析的展示必须在每次需要生成图表时去遍历数据的每一行。例如,查询过去一年的趋势,将从零开始汇总过去一年的每一行。

② 存储成本将与数据的增长以及可查询历史的长度成正比。在实际的应用中,我们可能希望将原始事件保存较短的时间,如一个月,而将历史图表保存较长时间,如N年。

1. 数据归纳

我们可以通过将原始数据汇总成预先聚合的表单来应对这两个问题。作为示例,我们将原始数据汇总到一个表中,该表存储以1分钟为间隔聚合后的数据。

CREATE TABLE http_request_1min (
  site_id INT,
  ingest_time TIMESTAMPTZ, -- which minute this row represents

  error_count INT,
  success_count INT,
  request_count INT,
  average_response_time_msec INT,
  CHECK (request_count = error_count + success_count),
  CHECK (ingest_time = date_trunc('minute', ingest_time))
);

SELECT create_distributed_table('http_request_1min', 'site_id');

CREATE INDEX http_request_1min_idx ON http_request_1min (site_id, ingest_time);

表http_request_1min也在site_id上进行切分,并且使用相同的默认配置,所以在表http_request切分和表http_request_1min的分片之间存在一一对应的关系,Citus将在同一个数据节点上放置匹配的分片,即,协同定位。 Co-location使诸如Join之类的查询变得更高效。

因为这两个表位于同一位置,我们将定期运行INSERT INTO SELECT。

-- single-row table to store when we rolled up last
CREATE TABLE latest_rollup (
  minute timestamptz PRIMARY KEY,

  -- "minute" should be no more precise than a minute
  CHECK (minute = date_trunc('minute', minute))
);

-- initialize to a time long ago
INSERT INTO latest_rollup VALUES ('10-10-1901');

-- function to do the rollup
CREATE OR REPLACE FUNCTION rollup_http_request() RETURNS void AS $$
DECLARE
  curr_rollup_time timestamptz := date_trunc('minute', now());
  last_rollup_time timestamptz := minute from latest_rollup;
BEGIN
  INSERT INTO http_request_1min (
    site_id, ingest_time, request_count,
    success_count, error_count, average_response_time_msec
  ) SELECT
    site_id,
    date_trunc('minute', ingest_time),
    COUNT(1) as request_count,
    SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
    SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
    SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
  FROM http_request
  -- roll up only data new since last_rollup_time
  WHERE date_trunc('minute', ingest_time) <@
          tstzrange(last_rollup_time, curr_rollup_time, '(]')
  GROUP BY 1, 2;

  -- update the value in latest_rollup so that next time we run the rollup it will operate on data newer than curr_rollup_time
  UPDATE latest_rollup SET minute = curr_rollup_time;
END;
$$ LANGUAGE plpgsql;

可以通过在协调器节点上添加crontab条目来模拟每分钟调用一次rollup_http_request()函数,pg_cron等扩展允许直接从数据库调度重复的查询。

查询归纳结果:

SELECT site_id, ingest_time as minute, request_count, success_count, error_count, average_response_time_msec
  FROM http_request_1min
  WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;

2. 处理过期数据

在每个粒度上确认保存数据的时间,并使用标准查询删除过期数据。可以将这些查询包装在一个函数中,并在cron作业中制定每隔一段时间调用一次。

DELETE FROM http_request WHERE ingest_time < now() - interval '1 day';
DELETE FROM http_request_1min WHERE ingest_time < now() - interval '1 month';

在Citus散列分布之上使用表范围分区,删除数据过期的效率可以更高。

统计优化

实时数据分析中一个常见的问题是处理近似不同的计数,如,上个月有多少访问者访问了某网站。如果要准确回答这个问题,则需要将所有的历史访问者的列表存储在归纳表中,这是一个非常大的工程,所以我们需要一个近似的答案来应对。

1. HLL近似算法的应用

举例一个全局查询,如,上个月访问某网站的IP地址数量。此查询将涉及从数据节点运输IP地址列表到协调节点,以便筛选重复记录。但这需要大量的网络流量和大量的计算。而结合HLLs,我们可以极大地提高查询速度。

Citus作为PostgreSQL的一个扩展插件,可以极大的利用PostgreSQL现有的插件优势,比如,HLL

CREATE EXTENSION hll;

现在我们准备用HLL来跟踪IP地址,首先向http_request_1min表添加一列。

ALTER TABLE http_request_1min ADD COLUMN distinct_ip_addresses hll;

调整rollup_http_request()函数:

CREATE OR REPLACE FUNCTION rollup_http_request() RETURNS void AS $$
DECLARE
  curr_rollup_time timestamptz := date_trunc('minute', now());
  last_rollup_time timestamptz := minute from latest_rollup;
BEGIN
  INSERT INTO http_request_1min (
    site_id, ingest_time, request_count,
    success_count, error_count, average_response_time_msec,
    distinct_ip_addresses
  ) SELECT
    site_id,
    date_trunc('minute', ingest_time),
    COUNT(1) as request_count,
    SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
    SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
    SUM(response_time_msec) / COUNT(1) AS average_response_time_msec,
    hll_add_agg(hll_hash_text(ip_address)) AS distinct_ip_addresses
  FROM http_request
  WHERE date_trunc('minute', ingest_time) <@
          tstzrange(last_rollup_time, curr_rollup_time, '(]')
  GROUP BY 1, 2;
  
  UPDATE latest_rollup SET minute = curr_rollup_time;
END;
$$ LANGUAGE plpgsql;

通过调用hll_cardinality()函数读出不同数量的IP地址:

SELECT site_id, ingest_time as minute, request_count,
       success_count, error_count, average_response_time_msec,
       hll_cardinality(distinct_ip_addresses) AS distinct_ip_address_count
  FROM http_request_1min
 WHERE ingest_time > date_trunc('minute', now()) - interval '5 minutes';

HLL还可以完成如下查询,在已经丢弃了原始数据的情况下,查询过去的一周内有多少不同的会话:

SELECT hll_cardinality(hll_union_agg(distinct_ip_addresses))
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;

2. 非结构化数据的处理

Postgres对非结构化数据类型有着良好的支持程度,使用非结构化数据类型可以避免为每个特性添加一列,从而避免出现拥有数百个稀疏填充列的行。延用本节归纳示例,记录来自每个国家的访问人数,将JSONB列合并到数据模型中。

首先添加国家信息列:

ALTER TABLE http_request_1min ADD COLUMN country_counters JSONB;

调整rollup_http_request()函数:

CREATE OR REPLACE FUNCTION rollup_http_request() RETURNS void AS $$
DECLARE
  curr_rollup_time timestamptz := date_trunc('minute', now());
  last_rollup_time timestamptz := minute from latest_rollup;
BEGIN
  INSERT INTO http_request_1min (
    site_id, ingest_time, request_count,
    success_count, error_count, average_response_time_msec,
    country_counters
  ) SELECT
    site_id,
    date_trunc('minute', ingest_time),
    COUNT(1) as request_count,
    SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
    SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
    SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
  FROM http_request,
    jsonb_object_agg(request_country, country_count) AS country_counters
    FROM (
      SELECT *,
        count(1) OVER (
        PARTITION BY site_id, date_trunc('minute', ingest_time), request_country
        ) AS country_count
      FROM http_request
    ) h
  WHERE date_trunc('minute', ingest_time) <@
          tstzrange(last_rollup_time, curr_rollup_time, '(]')
  GROUP BY 1, 2;

  UPDATE latest_rollup SET minute = curr_rollup_time;
END;
$$ LANGUAGE plpgsql;

查询来自美国的请求的数量:

SELECT
  request_count, success_count, error_count, average_response_time_msec,
  COALESCE(country_counters->>'USA', '0')::int AS american_visitors
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值