Flink的官网:https://flink.apache.org/
使用的软件:IntelliJ IDEA Community Edition
CoreAPI:
DataSet:专门处理离线数据,给离线数据处理设计了更多有针对性的API. env:ExecutionEnvironment
DataStream:一般用于处理流式数据,也可以处理离线数据env:StreamExecutionEnvironment
【这一次用的是DataStream】
创建SourceTest
package cn.tedu.datastream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class SourceTest {
public static void main(String[] args) throws Exception {
//1.获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2.获取数据源
DataStreamSource<Integer> source = env.fromElements(1, 2, 3, 4, 5);
//3.转化数据
source.map(x -> x*10) //把传进来的每个数都乘10
//4.输出结果
.print();
//5.触发程序执行
env.execute();
}
}
需要注意的点:
- 与使用DataSet输出的不同是在每行结果前多了一个数字和->,这个是关于电脑的线程,不用太关注
如果想把输出的数据保存成txt需要把第四步改一下
.writeAsText("a.txt").setParallelism(1);//setParallelism(1)是设置运行度,使只有一个线程运行,如果没写这个就会输出一个名为a的文件夹
创建TransformationTest.class
想要得到数据中的偶数并乘10输出
package cn.tedu.datastream;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class TransformationTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> source = env.fromElements(1, 2, 3, 4);
source.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) throws Exception {
return value % 2 == 0;//取余数为0的
}
}).map(x -> x * 10).print();
env.execute();
}
}
处理流式数据
练习一
前提:同开三个窗口,将kafka、生产者、消费者开启
要求:生产者中输入,消费者和ideal中的窗口都能接收到,在生产者中每输入一个数字ideal中输出的是就它们相加的和
package cn.tedu.datastream;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;
public class ConnKafkaTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2.获取数据源 从Kafka读取数据
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.65.161:9092");
properties.setProperty("group.id", "test");
DataStream<String> source = env
.addSource(new FlinkKafkaConsumer<>("flux", new SimpleStringSchema(), properties));
//3.转化数据
source.map(new MapFunction<String, Tuple2<String,Integer>>() {//转化成元祖,因为keyby需要用到f0、f1,这些只有元祖有
@Override
public Tuple2<String, Integer> map(String value) throws Exception {
return new Tuple2<>("num",Integer.parseInt(value));
//("num",1)
//("num",2)
//("num",3)
//("num",4)
}
}).keyBy(0).sum(1)//keyby用的是0:"num",如果用1的"1、2、3、4“的话分区会分成四个区
//4.输出结果
.print();
//5.触发程序执行
env.execute();
}
}
但是如果这么写的话,如果输入字符串的话就会报错,所以可以优化一下
在map前面再加一个过滤
//3.转化数据
source.filter(new FilterFunction<String>() {
@Override
public boolean filter(String value) throws Exception {
return value.matches("[0-9]+");//不能用*,不然输入空格会报错
}
})
练习二
使用窗口内计算:
package cn.tedu.datastream;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;
public class ConnKafkaTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2.获取数据源 从Kafka读取数据
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.65.161:9092");
properties.setProperty("group.id", "test");
DataStream<String> source = env
.addSource(new FlinkKafkaConsumer<>("flux", new SimpleStringSchema(), properties));
//3.转化数据
source.filter(new FilterFunction<String>() {
@Override
public boolean filter(String value) throws Exception {
return value.matches("[0-9]+");//不能用*,不然输入空格会报错
}
})
.map(new MapFunction<String, Tuple2<String,Integer>>() {//转化成元祖,因为keyby需要用到f0、f1,这些只有元祖有
@Override
public Tuple2<String, Integer> map(String value) throws Exception {
return new Tuple2<>("num",Integer.parseInt(value));
//("num",1)
//("num",2)
//("num",3)
//("num",4)
}
}).keyBy(0).timeWindow(Time.seconds(5))//5秒滚动窗口,窗口内的计算
.sum(1)
//4.输出结果
.print();
//5.触发程序执行
env.execute();
}
}
可以看到,当在生产者中输入数字,会等五秒出来,且前后的数字不会再相加了,除非很快地输入两个数字
不会相加是因为前后窗口不想加
练习三
要求:奇数相加、偶数相加
package cn.tedu.datastream;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;
public class ConnKafkaTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2.获取数据源 从Kafka读取数据
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.65.161:9092");
properties.setProperty("group.id", "test");
DataStream<String> source = env
.addSource(new FlinkKafkaConsumer<>("flux", new SimpleStringSchema(), properties));
//3.转化数据
source.filter(new FilterFunction<String>() {
@Override
public boolean filter(String value) throws Exception {
return value.matches("[0-9]+");//不能用*,不然输入空格会报错
}
})
.map(new MapFunction<String, Tuple2<String,Integer>>() {//转化成元祖,因为keyby需要用到f0、f1,这些只有元祖有
@Override
public Tuple2<String, Integer> map(String value) throws Exception {
int v =Integer.parseInt(value); //要先将String转化为int才能除
if (v % 2 == 0){
return new Tuple2<>("even",v);
}else {
return new Tuple2<>("odd",v);
}
}
}).keyBy(0).timeWindow(Time.seconds(5))//5秒滚动窗口,窗口内的计算
.sum(1)
//4.输出结果
.print();
//5.触发程序执行
env.execute();
}
}
练习四
将输入的数据中相同的地名的数字加起来
张飞|河北|1500
孙悟空|湖北|1550
唐僧|河北|2200
辛普森|河南|1900
奥特曼|河南|5000
蜘蛛侠|湖北|2200
package cn.tedu.datastream;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;
public class KafkaTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.65.161:9092");
properties.setProperty("group.id", "test");
DataStream<String> source = env
.addSource(new FlinkKafkaConsumer<>("flux", new SimpleStringSchema(), properties));
source.map(new MapFunction<String, Tuple2<String,Integer>>() {
@Override
public Tuple2<String,Integer> map(String value) throws Exception {
String[] s = value.split("\\|");
return new Tuple2<>(s[1],Integer.parseInt(s[2]));
}
})
// .keyBy(0).sum(1)
.print();
env.execute();
}
}
在生产者中输入数据,在ideal中输出