(四)实时数据处理
Citus提供对大型数据集的实时查询,以监控HTTP流量为例。每当客户端接收到HTTP请求时,服务就会接收到日志记录。应用程序会消化所有这些记录,并创建一个HTTP分析仪表板,如HTTP错误的数量等。我们关心的是,这些数据应该以尽可能少的延迟显示。
数据模型
我们将模拟一个结构不变的日志数据流,并将其直接插入到Citus中。实际的生产场景中,会首先通过Kafka之类的东西路由这些数据,以便在高并发的情况下,可以更容易地预先聚合数据。
1. 创建关系表
使用一个简单的模型来模拟摄取HTTP事件数据。协调节点上执行:
CREATE TABLE http_request (
site_id INT,
ingest_time TIMESTAMPTZ DEFAULT now(),
url TEXT,
request_country TEXT,
ip_address TEXT,
status_code INT,
response_time_msec INT
);
SELECT create_distributed_table('http_request', 'site_id');
create_distributed_table函数对切分计数使用默认配置值,Citus建议在集群中使用2-4倍的CPU内核分片数。通常,部署Citus使用流复制来实现高可用性,因此维护分片的副本是多余的。在流复制不可用的生产环境中,则需要设置citus的shard_replication_factor为2或更高,以实现容错。
2. 模拟数据流
后台的psql控制台中运行以下循环,模拟每隔一到两秒产生日志数据:
DO $$
BEGIN LOOP
INSERT INTO http_request (
site_id, ingest_time, url, request_country,
ip_address, status_code, response_time_msec
) VALUES (
trunc(random()*32), clock_timestamp(),
concat('http://example.com/', md5(random()::text)),
('{China,India,USA,Indonesia}'::text[])[ceil(random()*4)],
concat(
trunc(random()*250 + 2), '.',
trunc(random()*250 + 2), '.',
trunc(random()*250 + 2), '.',
trunc(random()*250 + 2)
)::inet,
('{200,404}'::int[])[ceil(random()*2)],
5+trunc(random()*150)
);
COMMIT;
PERFORM pg_sleep(random() * 0.25);
END LOOP;
END $$;
运行以下SQL查询数据入库情况:
SELECT
site_id,
date_trunc('minute', ingest_time) as minute,
COUNT(1) AS request_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
FROM http_request
WHERE date_trunc('minute', ingest_time) > now() - '5 minutes'::interval
GROUP BY site_id, minute
ORDER BY minute ASC;
实时统计
对这些实时数据的统计,一般会遇到以下两个问题:
① 分析的展示必须在每次需要生成图表时去遍历数据的每一行。例如,查询过去一年的趋势,将从零开始汇总过去一年的每一行。
② 存储成本将与数据的增长以及可查询历史的长度成正比。在实际的应用中,我们可能希望将原始事件保存较短的时间,如一个月,而将历史图表保存较长时间,如N年。
1. 数据归纳
我们可以通过将原始数据汇总成预先聚合的表单来应对这两个问题。作为示例,我们将原始数据汇总到一个表中,该表存储以1分钟为间隔聚合后的数据。
CREATE TABLE http_request_1min (
site_id INT,
ingest_time TIMESTAMPTZ, -- which minute this row represents
error_count INT,
success_count INT,
request_count INT,
average_response_time_msec INT,
CHECK (request_count = error_count + success_count),
CHECK (ingest_time = date_trunc('minute', ingest_time))
);
SELECT create_distributed_table('http_request_1min', 'site_id');
CREATE INDEX http_request_1min_idx ON http_request_1min (site_id, ingest_time);
表http_request_1min也在site_id上进行切分,并且使用相同的默认配置,所以在表http_request切分和表http_request_1min的分片之间存在一一对应的关系,Citus将在同一个数据节点上放置匹配的分片,即,协同定位。 Co-location使诸如Join之类的查询变得更高效。
因为这两个表位于同一位置,我们将定期运行INSERT INTO SELECT。
-- single-row table to store when we rolled up last
CREATE TABLE latest_rollup (
minute timestamptz PRIMARY KEY,
-- "minute" should be no more precise than a minute
CHECK (minute = date_trunc('minute', minute))
);
-- initialize to a time long ago
INSERT INTO latest_rollup VALUES ('10-10-1901');
-- function to do the rollup
CREATE OR REPLACE FUNCTION rollup_http_request() RETURNS void AS $$
DECLARE
curr_rollup_time timestamptz := date_trunc('minute', now());
last_rollup_time timestamptz := minute from latest_rollup;
BEGIN
INSERT INTO http_request_1min (
site_id, ingest_time, request_count,
success_count, error_count, average_response_time_msec
) SELECT
site_id,
date_trunc('minute', ingest_time),
COUNT(1) as request_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
FROM http_request
-- roll up only data new since last_rollup_time
WHERE date_trunc('minute', ingest_time) <@
tstzrange(last_rollup_time, curr_rollup_time, '(]')
GROUP BY 1, 2;
-- update the value in latest_rollup so that next time we run the rollup it will operate on data newer than curr_rollup_time
UPDATE latest_rollup SET minute = curr_rollup_time;
END;
$$ LANGUAGE plpgsql;
可以通过在协调器节点上添加crontab条目来模拟每分钟调用一次rollup_http_request()函数,pg_cron等扩展允许直接从数据库调度重复的查询。
查询归纳结果:
SELECT site_id, ingest_time as minute, request_count, success_count, error_count, average_response_time_msec
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;
2. 处理过期数据
在每个粒度上确认保存数据的时间,并使用标准查询删除过期数据。可以将这些查询包装在一个函数中,并在cron作业中制定每隔一段时间调用一次。
DELETE FROM http_request WHERE ingest_time < now() - interval '1 day';
DELETE FROM http_request_1min WHERE ingest_time < now() - interval '1 month';
在Citus散列分布之上使用表范围分区,删除数据过期的效率可以更高。
统计优化
实时数据分析中一个常见的问题是处理近似不同的计数,如,上个月有多少访问者访问了某网站。如果要准确回答这个问题,则需要将所有的历史访问者的列表存储在归纳表中,这是一个非常大的工程,所以我们需要一个近似的答案来应对。
1. HLL近似算法的应用
举例一个全局查询,如,上个月访问某网站的IP地址数量。此查询将涉及从数据节点运输IP地址列表到协调节点,以便筛选重复记录。但这需要大量的网络流量和大量的计算。而结合HLLs,我们可以极大地提高查询速度。
Citus作为PostgreSQL的一个扩展插件,可以极大的利用PostgreSQL现有的插件优势,比如,HLL。
CREATE EXTENSION hll;
现在我们准备用HLL来跟踪IP地址,首先向http_request_1min表添加一列。
ALTER TABLE http_request_1min ADD COLUMN distinct_ip_addresses hll;
调整rollup_http_request()函数:
CREATE OR REPLACE FUNCTION rollup_http_request() RETURNS void AS $$
DECLARE
curr_rollup_time timestamptz := date_trunc('minute', now());
last_rollup_time timestamptz := minute from latest_rollup;
BEGIN
INSERT INTO http_request_1min (
site_id, ingest_time, request_count,
success_count, error_count, average_response_time_msec,
distinct_ip_addresses
) SELECT
site_id,
date_trunc('minute', ingest_time),
COUNT(1) as request_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
SUM(response_time_msec) / COUNT(1) AS average_response_time_msec,
hll_add_agg(hll_hash_text(ip_address)) AS distinct_ip_addresses
FROM http_request
WHERE date_trunc('minute', ingest_time) <@
tstzrange(last_rollup_time, curr_rollup_time, '(]')
GROUP BY 1, 2;
UPDATE latest_rollup SET minute = curr_rollup_time;
END;
$$ LANGUAGE plpgsql;
通过调用hll_cardinality()函数读出不同数量的IP地址:
SELECT site_id, ingest_time as minute, request_count,
success_count, error_count, average_response_time_msec,
hll_cardinality(distinct_ip_addresses) AS distinct_ip_address_count
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - interval '5 minutes';
HLL还可以完成如下查询,在已经丢弃了原始数据的情况下,查询过去的一周内有多少不同的会话:
SELECT hll_cardinality(hll_union_agg(distinct_ip_addresses))
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;
2. 非结构化数据的处理
Postgres对非结构化数据类型有着良好的支持程度,使用非结构化数据类型可以避免为每个特性添加一列,从而避免出现拥有数百个稀疏填充列的行。延用本节归纳示例,记录来自每个国家的访问人数,将JSONB列合并到数据模型中。
首先添加国家信息列:
ALTER TABLE http_request_1min ADD COLUMN country_counters JSONB;
调整rollup_http_request()函数:
CREATE OR REPLACE FUNCTION rollup_http_request() RETURNS void AS $$
DECLARE
curr_rollup_time timestamptz := date_trunc('minute', now());
last_rollup_time timestamptz := minute from latest_rollup;
BEGIN
INSERT INTO http_request_1min (
site_id, ingest_time, request_count,
success_count, error_count, average_response_time_msec,
country_counters
) SELECT
site_id,
date_trunc('minute', ingest_time),
COUNT(1) as request_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 1 ELSE 0 END) as success_count,
SUM(CASE WHEN (status_code between 200 and 299) THEN 0 ELSE 1 END) as error_count,
SUM(response_time_msec) / COUNT(1) AS average_response_time_msec
FROM http_request,
jsonb_object_agg(request_country, country_count) AS country_counters
FROM (
SELECT *,
count(1) OVER (
PARTITION BY site_id, date_trunc('minute', ingest_time), request_country
) AS country_count
FROM http_request
) h
WHERE date_trunc('minute', ingest_time) <@
tstzrange(last_rollup_time, curr_rollup_time, '(]')
GROUP BY 1, 2;
UPDATE latest_rollup SET minute = curr_rollup_time;
END;
$$ LANGUAGE plpgsql;
查询来自美国的请求的数量:
SELECT
request_count, success_count, error_count, average_response_time_msec,
COALESCE(country_counters->>'USA', '0')::int AS american_visitors
FROM http_request_1min
WHERE ingest_time > date_trunc('minute', now()) - '5 minutes'::interval;