1、flink简介和java启动
1.1 简介
flink是主流的流处理器,官网地址:https://flink.apache.org/
通过监听某个端口或者日志等的变化,不断的获取变化的数据并进行处理,整个过程中数据以流的形式存在,与批处理不同,每一次的变化都会立即处理,不会等待。
目的:低延迟、高吞吐。
1.2 java模拟-DataSet
/**
* 经典的dataSet API调用方式, 1.12之后就不在推荐使用
* 批处理
*
* @param args 参数
* @throws Exception 异常
*/
public static void main(String[] args) throws Exception {
// 1、创建一个执行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// 2、从文件中读取数据
DataSource<String> dataSource = env.readTextFile("string/hellow.txt");
// 3、将每行数据进行分词,转换成二元组类型
FlatMapOperator<String, Tuple2<String, Long>> wordTuple = dataSource.flatMap((String line, Collector<Tuple2<String, Long>> out) -> {
// 将每一行的文本进行分词
String[] words = line.split(" ");
// 将每一个单词转成二元组输出
for (String word : words) {
out.collect(Tuple2.of(word, 1L));
}
}).returns(Types.TUPLE(Types.STRING, Types.LONG));
// 4、按照word进行分组
UnsortedGrouping<Tuple2<String, Long>> wordGroup = wordTuple.groupBy(0);
// 5、分组内进行聚合统计
AggregateOperator<Tuple2<String, Long>> sum = wordGroup.sum(1);
// 6、打印结果
sum.print();
}
1.3 java模拟-DataStream
public static void main(String[] args) throws Exception {
// 1、创建一个流式的执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 2、从文件中读取数据
DataStreamSource<String> dataStreamSource = env.socketTextStream("localhost", 10001);
// 3、将每行数据进行分词,转换成二元组类型
SingleOutputStreamOperator<Tuple2<String, Long>> wordTuple = dataStreamSource.flatMap((String line, Collector<Tuple2<String, Long>> out) -> {
// 将每一行的文本进行分词
String[] words = line.split(" ");
// 将每一个单词转成二元组输出
for (String word : words) {
out.collect(Tuple2.of(word, 1L));
}
}).returns(Types.TUPLE(Types.STRING, Types.LONG));
// 4、按照word进行分组
KeyedStream<Tuple2<String, Long>, String> wordTupleKeyedStream = wordTuple.keyBy(data -> data.f0);
// 5、分组内进行聚合统计
SingleOutputStreamOperator<Tuple2<String, Long>> sum = wordTupleKeyedStream.sum(1);
// 6、打印结果
sum.print();
// 7、启动执行:因为是流,所以需要给出一个结束的边界
env.execute();
}
2、flink集群安装和启动
环境:java - jdk8
yum install -y java-1.8.0-openjdk-devel.x86_64
2.1 下载对应版本
https://flink.apache.org/downloads.html#all-stable-releases
2.2 解压
tar -zxvf flink-1.16.0-bin-scala_2.12.tgz -C /opt/module/
解压后目录
[root@hecs-74580 flink]# ls
flink-1.13.2 flink-1.13.2-bin-scala_2.11.tgz Ņ
[root@hecs-74580 flink]# cd flink-1.13.2
[root@hecs-74580 flink-1.13.2]# ll
total 496
drwxr-xr-x 2 502 games 4096 Dec 10 23:10 bin
drwxr-xr-x 2 502 games 4096 Jul 23 2021 conf
drwxr-xr-x 7 502 games 4096 Dec 10 23:10 examples
drwxr-xr-x 2 502 games 4096 Dec 10 23:10 lib
-rw-r--r-- 1 502 games 11357 May 31 2021 LICENSE
drwxr-xr-x 2 502 games 4096 Dec 10 23:10 licenses
drwxr-xr-x 2 502 games 4096 Jul 2 2021 log
-rw-r--r-- 1 502 games 455192 Jul 23 2021 NOTICE
drwxr-xr-x 3 502 games 4096 Dec 10 23:10 opt
drwxr-xr-x 10 502 games 4096 Dec 10 23:10 plugins
-rw-r--r-- 1 502 games 1309 May 31 2021 README.txt
2.3 启动
/opt/module/flink-1.16.2/bin/start-cluster.sh
cd flink-1.16.0/
bin/start-cluster.sh
2.4 停止
bin/stop-cluster.sh
2.5 集群配置文件
/opt/module/flink-1.16.2/conf/
文件夹下有 masters 和 works 指向主、从节点
localhost:8081就是启动后的UI界面
[root@hecs-74580 conf]# cat masters
localhost:8081
[root@hecs-74580 conf]# cat workers
localhost
存在两种服务:jobmanager 和 taskmanager
指定服务为jobmanager,/opt/module/flink-1.16.2/conf/flink-conf.yaml:
cd conf/
vim flink-conf.yaml
将文件中 jobmanager.rpc.address: ip:端口(或者域名)
3、Flink-API
3.1 api层级划分
3.2 DataStream-API
参考1.3
3.3 Sql-API
CREATE TEMPORARY TABLE tableName ... WITH ( 'connector'= ... )
样例:
CREATE TABLE tableName(
id BIGINT,
name STRING,
updateTime STRING,
version BIGINT,
PRIMARY KEY(_id) NOT ENFORCED
) WITH (
'connector' = 'mongodb-cdc',
'connection.options' = 'authSource=dbName',
'host' = '0.0.0.0:6795',
'username' = 'userName_r',
'password' = 'xxxxx',
'database' = 'dbName',
'collection' = 'mongodbTableName',
'copy.existing' = 'false'
);