Application间的资源隔离
Exg:用户在基于某个App下的Stream时,只能在改App下查看该App下的StreamJob;当用户退出该App时,将无法查看其他App下的StreamJob只能查看当前App下的StreamJob。
流上的统计
Emily接到了老板的第一个任务:如何实现对网站访问次数做统计。假设源数据如下:
27.0.1.125,www.transwarp.io/home.html,2016-8-14 20:12:31.132
54.231.66.16,www.transwarp.io/product.html,2016-8-14 20:43:31.213
72.21.203.5,www.transwarp.io/case.html,2016-8-14 20:45:31.132
207.241.224.2,www.transwarp.io/product.html,2016-8-14 20:46:15.540
12.129.206.133,www.transwarp.io/product.html,2016-8-14 20:47:21.332
208.111.148.7,www.transwarp.io/home.html,2016-8-14 20:50:31.876
- Emily创建一个流和一个表:
CREATE STREAM accesslog(ip STRING, url STRING, time TIMESTAMP) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' TBLPROPERTIES("topic"="accesslog","kafka.zookeeper"="172.16.1.128:2181"
,"kafka.broker.list"="172.16.1.128:9092");
按系统时间统计每10秒各个 url 有多少访问量:
CREATE TABLE result(url STRING, count INT);
CREATE STREAM waccesslog AS SELECT * FROM accesslog STREAMWINDOW (LENGTH '10' SECOND SLIDE '10'
SECOND);
INSERT INTO result SELECT url, COUNT(*) FROM waccesslog GROUP BY url;
- 按消息时间统计每10秒各个 url 有多少访问量:
CREATE TABLE result2(url STRING, count INT);
CREATE STREAM accesslog2(ip STRING, url STRING, time TIMESTAMP)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES ("topic"="accesslog", "kafka.zookeeper"="172.16.1.128:2181","kafka.broker.list"
="172.16.1.128:9092","timefield"="time", "timeformat"="yyyy-MM-dd HH-mm-ss.SSS");
CREATE STREAM waccesslog2 AS SELECT * FROM accesslog2 STREAMWINWDOW sw AS (SEPARATED BY time LENGTH
'10' SECOND SLIDE '10' SECOND);
INSERT INTO result2 SELECT url, COUNT(*) FROM waccesslog2 GROUP BY url;
- 按时间字段切分的多流关联
由于基于Slipstream实现比较快,Emily接到新的任务将原有sql迁移,需要将两个流和一个维度表按照数据 字段时间做切分。现有期货价格和现货价格两个流, 分别有ID, 时间, 和价格三个字段,例如:
qihuo
2016-8-14 20:50:00,1,50.20
2016-8-14 20:50:00,2,65.40
2016-8-14 20:50:00,3,31.30
2016-8-14 20:50:01,1,50.80
2016-8-14 20:50:01,2,65.10
2016-8-14 20:50:01,3,29.90
…
xianhuo
2016-8-14 20:50:00,1,55.10
2016-8-14 20:50:00,2,67.20
2016-8-14 20:50:00,3,33.10
2016-8-14 20:50:01,1,55.20
2016-8-14 20:50:01,2,66.70
2016-8-14 20:50:01,3,30.30
Emily先创建两个流:
CREATE STREAM qihuo2(timestamp STRING, id STRING, price DOUBLE) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' TBLPROPERTIES ('topic'='qihuo', 'timefield'='timestamp', 'timeformat'='yyyy-MM-dd
HH-mm-ss', "kafka.zookeeper"="tw-node127:2181", "kafka.broker.list"="172.16.1.128:9092");
CREATE STREAM xianhuo2(timestamp STRING, id STRING, price DOUBLE) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' TBLPROPERTIES ('topic'='xianhuo', 'timefield'='timestamp', 'timeformat'='yyyy-MM-
dd HH-mm-ss', "kafka.zookeeper"="tw-node127:2181", "kafka.broker.list"="172.16.1.128:9092");
按消息时间对两个流进行关联, 并求现货于期货价格之差, 转化了 (时间, ID, 现货期货之差) 为字段的数 据流:
CREATE STREAM transform_stream2 AS
SELECT
qihuo.timestamp,
qihuo.id,
(xianhuo.price - qihuo.price) AS diff
FROM
qihuo2 qihuo JOIN xianhuo2 xianhuo
ON
qihuo.timestamp=xianhuo.timestamp
AND
qihuo.id=xianhuo.id;
在上述数据流上,每隔1秒,取长度为3分钟的窗口. 转化为 “最近3分钟” 的数据流:
CREATE STREAM window_diff_stream AS
SELECT * FROM transform_stream2
STREAMWINDOW w1 AS (LENGTH '3' MINUTE SLIDE '1' SECOND);
最终求取最近3分钟内, 每个ID现货期货之差的最大值和最小值,以及它们的差:
CREATE VIEW max_min_diff_window_stream AS SELECT maxTable.id, maxTable.diff AS maxDiff,
maxTable.timestamp AS maxTime, minTable.diff AS minDiff, minTable.timestamp AS minTime,
(maxTable.diff - minTable.diff) AS maxMinDiff, minTable.maxtime AS calTime FROM
(SELECT id,
timestamp,
diff,
maxtime FROM (
SELECT id,
timestamp,
diff,
row_number() OVER (PARTITION BY id ORDER BY diff ) AS minDiff,
max(timestamp) OVER (PARTITION BY id) AS maxtime FROM window_diff_stream)
WHERE minDiff=1) minTable
JOIN
(SELECT id,
timestamp,
diff FROM (
SELECT id,
timestamp,
diff,
row_number() OVER (PARTITION BY id ORDER BY diff DESC ) AS maxDiff
FROM window_diff_stream)
WHERE maxDiff=1) maxTable
ON minTable.id = maxTable.id;mily触发计算:
CREATE TABLE result(maxTableId STRING, maxDiff DOUBLE, maxTime STRING, minDiff DOUBLE, minTime
STRING, maxMinDiff DOUBLE, calTime STRING);
SET streamsql.use.eventtime=true;
INSERT INTO result SELECT * FROM max_min_diff_window_stream;
流上的PL/SQL
当她老板得知Slipstream还支持PLSQL时,让她调研是否将原有的PLSQL代码迁移到Slipstream上。除了上述 期货和现货表, 现在还有一个交易表, 包含ID,时间,以及交易量, Emily要对每个ID累计从开市到当前时间的 交易量。另外一个需求是当 max_min_diff_window_stream 这个 stream 中 maxmindiff 大于等于 50 时, 触发警报, 选取day_sum中, 累计值前20的记录, 插入警告表:
CREATE STREAM transaction_stream(timestamp STRING, id STRING, transaction DOUBLE) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' TBLPROPERTIES ('topic'='transaction', "kafka.zookeeper"="tw-
node127:2181");
CREATE TABLE day_sum(id STRING , sd STRING, total DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY
',';
CREATE TABLE warm_transaction (id STRING, timestamp STRING, total DOUBLE);
CREATE VIEW transaction_sum AS SELECT id, timestamp, sum(s.transaction) total FROM
transaction_stream s GROUP BY id, timestamp;
SET plsql.show.sqlresults=true;
SET stream.enabled=true;
SET stream.tables=qihuo2,xianhuo2,transaction_stream;
DECLARE
threshold_count int := 0
BEGIN
INSERT INTO day_sum SELECT id, sd, CASE WHEN isnull(total2) THEN total1 ELSE (total1 + total2) END
total FROM
(SELECT t1.id id, t1.timestamp sd , t1.total total1, t2.total total2 FROM transaction_sum t1 LEFT
JOIN day_sum t2 ON t1.id=t2.id AND (to_unix_timestamp(t2.sd, 'yyyy-MM-dd HH:mm:ss') +
1)=unix_timestamp(t1.timestamp, 'yyyy-MM-dd HH:mm:ss'))
CREATE STREAM transaction_stream(timestamp STRING, id STRING, transaction DOUBLE) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' TBLPROPERTIES ('topic'='${db}.transaction', "kafka.zookeeper"
="tw-node127:2181");
SELECT count(*) INTO threshold_count FROM max_min_diff_window_stream WHERE maxmindiff >= 50
IF threshold_count > 0 THEN
INSERT INTO warm_transaction SELECT id, sd, total FROM day_sum ORDER BY total DESC LIMIT 20
END IF END;
StreamJob的持久化
Emily可以熟练使用Slipstream了,但有一天当她配置了一些信息后,应用意外终止了,因此她开始使 用StreamJob做持久化:
CREATE STREAMJOB access_log_streamjob AS ("INSERT INTO result SELECT url, COUNT(*) FROM waccesslog
GROUP BY url");
START STREAMJOB access_log_streamjob;
邮件警告
在每台机器的/etc/inceptor1/conf/alert4j.properties中配置:
alert4j.service=email
email.server.host=smtp.exmail.qq.com
email.server.port=465
email.server.ssl=true
email.validate=true
email.sender.username=prod@xxx.io
email.sender.password=password
email.from.address=prod@xxx.io
email.to.addresses=stream-alert@xxx.io
当系统发生问题时,系统会发邮件到stream-alert@xxx.io, 其中包含了关于系统问题的重要信息.
高可用性
通过Transwarp Manager管理界面配置额外的InceptorServer,设置它的role为Standby。 设置完毕后将其启 动。
使用 checkpoint 和 WAL 保证不丢失数据:
第一步, 通过Transwarp Manager管理界面修改InceptorServer配置,添加"stream.driver.checkpoint.dir" 配置项,并将其设置为HDFS上的某个目录。配置完毕后,重启Inceptor。
第二步, 创建 application 时需要指定 application 的 checkpoint 目录,设置为HDFS上的某个目录:
CREATE APPLICATION app1 WITH APPPROPERTIES("application.checkpoint.dir"="/tmp/app1/",
"application.enable.wal"="true");
或者对于已有的 application, 设置这两个属性:
ALTER APPLICATION app1 SET APPPROPERTIES("application.checkpoint.dir"="/tmp/app1/");
ALTER APPLICATION app1 SET APPPROPERTIES("application.enable.wal"="true");
Holodesk配置
Emily老板告诉她,网站访问次数统计功能非常有用,有好几个其他小组都在使用该统计结果。然而,由于当 前数据存在Hive Table,查询速度太慢。老板让她想想有没有什么办法使查询速度更快。Emily想到组 件Holodesk可以用于快速查询分析,因此她将结果表设置为Holodesk,并做了相关参数设置:
CREATE TABLE holo(url STRING, count INT) STORED AS HOLODESK TBLPROPERTIES("holodesk.window.length"
="100000","holodesk.window.timeout"="100000");
CREATE STREAM w_accesslog AS SELECT * FROM accesslog STREAMWINDOW sw AS (LENGTH '10' SECOND SLIDE
'10' SECOND) TBLPROPERTIES("kafka.zookeeper"="tw-node127:2181");
INSERT INTO holo SELECT url, COUNT(*) FROM w_accesslog GROUP BY url;
Slipstream参数整理
SET =;