注:文章是本人对官网阅读的一些常用知识总结
目录
前言
flink是一个分布式处理事件数据流的框架和处理引擎,其中有限数据集被称为批处理,是大数据主流运用的计算引擎之一。
一、Flink SQL 应用
1.示例
--建表--
CREATE TABLE employee_information (
emp_id INT,
name VARCHAR,
dept_id INT
) WITH (
'connector' = 'filesystem',
'path' = '/path/to/something.csv',
'format' = 'csv'
);
SELECT * from employee_information WHERE dept_id = 1;
--窗口 & 窗口聚合--
SELECT window_start, window_end, SUM(price)
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;
SELECT window_start, window_end, SUM(price)
FROM TABLE(
HOP(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '5' MINUTES, INTERVAL '10' MINUTES))
GROUP BY window_start, window_end;
--分组聚合--
SELECT COUNT(*)
FROM Orders
GROUP BY order_id
--join--
SELECT *
FROM Orders
INNER JOIN Product
ON Orders.product_id = Product.id;
SELECT *
FROM Orders
LEFT JOIN Product
ON Orders.product_id = Product.id
flink sql和hive sql最大区别在于数据具有时间语义,这一点在开发时要注意;数据关联,聚合,排序等有中间状态的要注意状态膨胀,合理设置TTL及时间窗口和范围尤为重要。
2.应用场景
二、FLink API应用
1.示例
快速构建maven项目:
$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-walkthrough-datastream-java \
-DarchetypeVersion=1.15.0 \
-DgroupId=frauddetection \
-DartifactId=frauddetection \
-Dversion=0.1 \
-Dpackage=spendreport \
-DinteractiveMode=false
public class FraudDetectionJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Transaction> transactions = env
.addSource(new TransactionSource())
.name("transactions");
DataStream<Alert> alerts = transactions
.keyBy(Transaction::getAccountId)
.process(new FraudDetector())
.name("fraud-detector");
alerts
.addSink(new AlertSink())
.name("send-alerts");
env.execute("Fraud Detection");
}
}
基于 DataStream API 实现欺诈检测 | Apache Flink
三、Flink 架构

JobManager: Flink 的中心工作协调组件的名称,负责收集Job的状态信息,并管理Flink集群中从节点TaskManager
TaskManager: 是实际执行 Flink 作业工作的服务,内部启动多个线程Slot执行具体的SubTask
四、Flink部署和示例
1.hadoop集成当作sql执行引擎
2.K8S集成,CI/CD提交应用
3.本地模式
Streaming ETL for MySQL and Postgres with Flink CDC — CDC Connectors for Apache Flink® documentation
1> 准备docker环境
cd .. && mkdir flink_cdc_etl
touch docker-compose.yml
docker-compose -f ./docker-compose.yml up -d --build
mkdir flink && cd flink
touch docker-compose.yml
docker-compose -f ./docker-compose.yml up -d --build
docker ps
flin_cdc_etl/docker-compose.yml
version: '2.1'
services:
postgres:
image: debezium/example-postgres:1.1
ports:
- "5432:5432"
environment:
- POSTGRES_DB=postgres
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
mysql:
image: debezium/example-mysql:1.1
ports:
- "3306:3306"
environment:
- MYSQL_ROOT_PASSWORD=123456
- MYSQL_USER=mysqluser
- MYSQL_PASSWORD=mysqlpw
elasticsearch:
image: elastic/elasticsearch:7.6.0
environment:
- cluster.name=docker-cluster
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- discovery.type=single-node
ports:
- "9200:9200"
- "9300:9300"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
kibana:
image: elastic/kibana:7.6.0
ports:
- "5601:5601"
flink/docker-compose.yml
version: "2.2"
networks:
flink_cdc_etl_default:
external: true
services:
jobmanager:
image: flink:1.13.2-java8
networks:
- "flink_cdc_etl_default"
ports:
- "8081:8081"
command: jobmanager
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: jobmanager
taskmanager:
image: flink:1.13.2-java8
networks:
- "flink_cdc_etl_default"
depends_on:
- jobmanager
command: taskmanager
scale: 1
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: jobmanager
taskmanager.numberOfTaskSlots: 4
打开页面验证容器是否运行成功http://localhost:5601/,http://localhost:8081/

拷贝需要jar到flink容器中并重启flink容器(阿里maven中下载,xxx-cdc-2.x版本)

分别执行mysql-sql,postgre-sql,flink-sql






flink任务及kibana数据


变更数据,flink查询及kibana查询数据在同步实时输出结果 


五、扩展
1.Flink实时数仓


2.Flink+数据湖(hudi)

424

被折叠的 条评论
为什么被折叠?



