Flink流数据采集笔记(四):各种Transform转换算子

目录

一  注意

二  map

三  flatMap

四  filter

五  keyBy

六  shuffle

七  Connect和Union 

八  简单滚动聚合算子

九  reduce

十  process

十一  对流重新分区的几个算子


一  注意

        有些Transform转换算子可以实现富函数,例如map的RichMapFunction,凡是大多数带Rich的,很多都是富函数,可以实现各种方法,例如实现open()与close()

        open:每个并行度调用一次 适合用于初始化或者创建链接等操作

        close:每个并行度调用一次 适合用于清理操作比如关闭链接  只有在读文件的时候会调用两次

        具体使用可见map的代码

二  map

        作用:将数据流中的数据进行转换, 形成新的数据流,消费一个元素并产出一个元素

import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class Flink01_TransForm_Map {
    public static void main(String[] args) throws Exception {
        //1.获取流的执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setParallelism(1);

        //2.从端口读取数据
//        DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);
        DataStreamSource<String> streamSource = env.readTextFile("input/sensor.txt");

        System.out.println("111111111111111111");

        SingleOutputStreamOperator<WaterSensor> map = streamSource.map(new MyMap());//.setParallelism(2);

        System.out.println("22222222222222222222");

        map.print();

        env.execute();
    }

    public static class MyMap extends RichMapFunction<String,WaterSensor>{

        /**
         * 生命周期方法,最先被调用 每个并行度调用一次 适合用于初始化或者创建链接等操作
         * @param parameters
         * @throws Exception
         */
        @Override
        public void open(Configuration parameters) throws Exception {
            System.out.println("open...");
        }

        /**
         * 生命周期方法,最后被调用 每个并行度调用一次 适合用于清理操作比如关闭链接  只有在读文件的时候会调用两次
         * @throws Exception
         */
        @Override
        public void close() throws Exception {
            System.out.println("close....");
        }

        @Override
        public WaterSensor map(String value) throws Exception {
            //getRuntimeContext指的是运行时上下文对象,可以获取到更多的信息,比如TaskName,JobId,状态相关的
            System.out.println(getRuntimeContext().getTaskName());
            String[] split = value.split(",");
            return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
        }
    }
}

三  flatMap

        作用:消费一个元素并产生个或多个元素(一进多出,出可是0)

        注意:不建议使用Lambda表达式,因为在使用Lambda表达式表达式的时候, 由于泛型擦除的存在, 在运行的时候无法获取泛型的具体类型, 全部当做Object来处理, 及其低效, 所以Flink要求当参数中有泛型的时候, 必须明确指定泛型的类型.

        代码最后有Lambda表达式的正确写法.

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class Flink02_TranForm_FlatMap {
    public static void main(String[] args) throws Exception {
        //1.获取流的执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setParallelism(1);

        //2.从端口读取数据
        DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);


        //TODO 3.使用FlatMap将一行数据按照空格切分切出每一个单词
        streamSource.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String value, Collector<String> out) throws Exception {
                String[] words = value.split(" ");
                for (String word : words) {
                    out.collect(word);
                }
            }
        }).print();

        env.execute();


    }
}

Lambda表达式正确写法

env
  .fromElements(1, 2, 3, 4, 5)
  .flatMap((Integer value, Collector<Integer> out) -> {
      out.collect(value * value);
      out.collect(value * value * value);
  }).returns(Types.INT)
  .print();

四  filter

作用:过滤;根据指定的规则将满足条件(true)的数据保留,不满足条件(false)的数据丢弃


import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class filter {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> streamSource = env.readTextFile("input/yao.txt");
        streamSource.setParallelism(1);

        //3.使用Map将端口读过来的数据转为JavaBean
        SingleOutputStreamOperator<WaterSensor> waterSensorDStream = streamSource.map(new MapFunction<String, WaterSensor>() {
            @Override
            public WaterSensor map(String value) throws Exception {
                String[] split = value.split(" ");
                return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
            }
        });

        //TODO 4.使用Filter过滤出id为s1的数据
        waterSensorDStream.filter(new FilterFunction<WaterSensor>() {
            @Override
            public boolean filter(WaterSensor value) throws Exception {
                return "s1".equals(value.getId());
            }
        }).print();

        env.execute();

    }
}

五  keyBy

作用:具有相同key的元素会分到同一个分区中.一个分区中可以有多重不同的key.

import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class Flink04_TranForm_Keyby {
    public static void main(String[] args) throws Exception {
        //1.获取流的执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setParallelism(4);

        //2.从端口读取数据
        DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);


        //3.使用Map将端口读过来的数据转为JavaBean
        SingleOutputStreamOperator<WaterSensor> waterSensorDStream = streamSource.map(new MapFunction<String, WaterSensor>() {
            @Override
            public WaterSensor map(String value) throws Exception {
                String[] split = value.split(",");
                return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
            }
        }).setParallelism(2);

        //TODO 4.将相同id的数据聚合到一块

        KeyedStream<WaterSensor, String> keyedStream = waterSensorDStream.keyBy(new KeySelector<WaterSensor, String>() {
            @Override
            public String getKey(WaterSensor value) throws Exception {
                return value.getId();
            }
        });
      

        waterSensorDStream.print("原始分区").setParallelism(2);

        keyedStream.print("keyBy");

        env.execute();
    }
}

六  shuffle

作用:把流中的元素随机打乱.

env
  .fromElements(10, 3, 5, 9, 20, 8)
  .shuffle()
  .print();
env.execute();

七  Connect和Union 

Connect相当于同床异梦,虽然进行数据匹配,但是保持他们类型的数据流;Union水乳交融,彻底合并

union之前两个流的类型必须是一样connect可以不一样

connect只能操作两个流,union可以操作多个。

import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;

public class connect {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        DataStreamSource<String> String = env.fromElements("a", "b", "c", "d");
        DataStreamSource<Integer> Int = env.fromElements(1, 2, 3, 4);
        ConnectedStreams<java.lang.String, Integer> connect = String.connect(Int);
        connect.map(new CoMapFunction<java.lang.String, Integer, String>() {
            @Override
            public java.lang.String map1(java.lang.String value) throws Exception {
                return value;
            }

            @Override
            public java.lang.String map2(Integer value) throws Exception {
                return Integer.toString(value*10);
            }
        }).print();

        env.execute();
    }
}
stream1
  .union(stream2)
  .union(stream3)
  .print();

八  简单滚动聚合算子

sum, min,maxminBy,maxBy ; 因为都是KeyedStream类的,所以必须使用在KeyBy之后,结果都是DataStream 

注意:

        1  来一条聚合一条.

        2  作用范围,都是分组内。

        3  加by的区别(例如:max和,maxby):

                max:取指定字段的当前的最大值,如果有多个字段,其他非比较字段,以第一条为准

                maxBy:取指定字段的当前的最大值,如果有多个字段,其他字段以最大值那条数据为准;

                如果出现两条数据都是最大值,由第二个参数决定: true => 其他字段取 比较早的值; false => 其他字段,取最新的值

  reduce

作用:一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。


import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class reduce {
    public static void main(String[] args) throws Exception {
        //1.获取流的执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setParallelism(1);

        //2.从端口读取数据
        DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);


        //3.使用Map将端口读过来的数据转为JavaBean
        SingleOutputStreamOperator<WaterSensor> waterSensorDStream = streamSource.map(new MapFunction<String, WaterSensor>() {
            @Override
            public WaterSensor map(String value) throws Exception {
                String[] split = value.split(",");
                return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
            }
        });

        //4.将相同id的数据聚合到一块
        KeyedStream<WaterSensor, Tuple> keyedStream = waterSensorDStream.keyBy("id");

        //TODO 5.使用Reduce求Vc的最大值
        keyedStream.reduce(new ReduceFunction<WaterSensor>() {
            @Override
            public WaterSensor reduce(WaterSensor value1, WaterSensor value2) throws Exception {
                System.out.println("recude....");
                return new WaterSensor(value1.getId(), value2.getTs(), Math.max(value1.getVc(), value2.getVc()));
            }
        }).print();
        env.execute();
    }
}
import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class reduce {
    public static void main(String[] args) throws Exception {
        //1.获取流的执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setParallelism(1);

        //2.从端口读取数据
        DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);


        //3.使用Map将端口读过来的数据转为JavaBean
        SingleOutputStreamOperator<WaterSensor> waterSensorDStream = streamSource.map(new MapFunction<String, WaterSensor>() {
            @Override
            public WaterSensor map(String value) throws Exception {
                String[] split = value.split(",");
                return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
            }
        });

        //4.将相同id的数据聚合到一块
        KeyedStream<WaterSensor, Tuple> keyedStream = waterSensorDStream.keyBy("id");

        //TODO 5.使用Reduce求Vc的最大值
        keyedStream.reduce(new ReduceFunction<WaterSensor>() {
            @Override
            public WaterSensor reduce(WaterSensor value1, WaterSensor value2) throws Exception {
                System.out.println("recude....");
                return new WaterSensor(value1.getId(), value2.getTs(), Math.max(value1.getVc(), value2.getVc()));
            }
        }).print();
        env.execute();
    }
}

十  process

作用:很多类型的流上都可以调用,可以从流中获取更多的信息(不仅仅数据本身);在Flink没有相应逻辑代码时使用,例如去重

注意:在keyBy之前的流上使用,new ProcessFunction;在keyBy之后的流上使用,new KeyedProcessFunction

public class Flink10_TransForm_Process {
    public static void main(String[] args) throws Exception {
        //1.获取流的执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setParallelism(1);

        //2.从端口读取数据
        DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);

        //3.TODO 使用process实现flatMap的功能 将数据转为Tuple2元组
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordToOneStream = streamSource.process(new ProcessFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void processElement(String value, Context ctx, Collector<Tuple2<String, Integer>> out) throws Exception {
                String[] words = value.split(" ");
                for (String word : words) {
                    out.collect(Tuple2.of(word, 1));
                }
            }
        });

        //4.将相同单词的数据聚合到一块
        KeyedStream<Tuple2<String, Integer>, Tuple> keyedStream = wordToOneStream.keyBy(0);

        //5.TODO 使用Process实现Sum的功能
        keyedStream.process(new KeyedProcessFunction<Tuple, Tuple2<String, Integer>, Tuple2<String, Integer>>() {
            //定义一个累加器,保存上一次累加后结果 bug 这个遍量中保存的至并没有区分key,所有的key都可以获取到这个值并改变
//            private Integer lastSum = 0;
            private HashMap<String, Integer> lastSumMap = new HashMap<>();

            @Override
            public void processElement(Tuple2<String, Integer> value, Context ctx, Collector<Tuple2<String, Integer>> out) throws Exception {
                //1.判断当前单词是否在Map中并作为key
                if (lastSumMap.containsKey(value.f0)){
                    //2.根据key取出之前累加后的结果
                    Integer lastSum = lastSumMap.get(value.f0);
                    //3.在此基础上加1
                    Integer curSum = lastSum + 1;
                    //4.将累加后的值输出并更新到Map中
                    out.collect(Tuple2.of(value.f0,curSum));
                    lastSumMap.put(value.f0, curSum);
                }else {
                    //如果当前这条数据没有在Map中保存,证明是这个key的第一条数据
                    lastSumMap.put(value.f0, 1);
                    out.collect(Tuple2.of(value.f0, 1));
                }
            }
        }).print();

        env.execute();
    }
}

十一  对流重新分区的几个算子

KeyBy:先按照key分组, 按照key的双重hash来选择后面的分区

shuffle:对流中的元素随机分区

reblance:对流中的元素平均分布到每个区.当处理倾斜数据的时候, 进行性能优化

rescale:同 rebalance一样, 也是平均循环的分布数据。但是要比rebalance更高效, 因为rescale不需要通过网络, 完全走的"管道"。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值