大数据Flink

Flink学习

Flink是什么

基于数据流的有状态的计算,分布式处理引擎框架,作用于有无界有界的数据流上。

无界流:有头没有尾,源源不断,无穷无尽。不可能等待所有数据结束才去处理。

有界流:有始有终,可以等待所有数据都准备好了才去处理,可以理解为批处理。

Flik应用程序可以处理数据来了就处理,还可以先把数据存下来再处理。

分层接口API

越往下级别越高,但表达能力越低

  • Stateful Event-Driven Application,底层使用ProcessFunction,需要实现它提供的方法
  • Stream & Batch Data Processing (工作中用的比较多)
  • High-level Analytics Api 比如:SQL/Table API(还不够成熟)

Flink运行多样化

可以运行在Hadoop Yarn、Apache Mesos、Kubernetes这些资源管理器上,也可以运行在flink集群上。flink可以自动去识别这些资源管理器。

业界流处理框架对比

  1. Spark: Streaming 结构化流 批处理为主

    流式处理是批处理的一个特例(mini batch)

  2. Flink: 流式为主,批处理是流式处理的一个特例

  3. Storm: 流式 Tuple

Use Cases(使用场景)

  • Event-driven 数据驱动应用程序
  • Data Analytics 数据分析应用程序
  • Data Pipeline 数据管道应用程序

如何高效的学习Flink

  1. 官网
  2. 通过源码 maven把源码关联上 也可以通过官网的examples比如github上flink-example模块

Flink开发批处理应用程序

需求:词频统计(Word count)

一个文件,统计文件中每个单词出现的次数

分隔符\t

统计结果我们直接打印在控制台(生产上肯定是Sink到目的地)

实现:Flink+Java 要求Maven版本是3.0.4(or higher)

out of the box :OOTB 开箱即用

第一种创建方式

1.执行以下命令得到项目
mvn archetype:generate  -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-java -DarchetypeVersion=1.7.0 -DarchetypeCatalog=local
2.使用idea导入到工程

开发流程/开发八股文编程

  1. set up the batch execution environment
  2. read data
  3. transform operations 开发的核心所在:开发业务逻辑
  4. env.execute

功能拆解

  1. 读取数据

    hello welcome

  2. 每一行的数据按照指定的分隔符拆分

    hello

    welcome

  3. 为每一个单词附上次数为1

    (hello,1)

    (welcome,1)

  4. 合并操作 groupBy

代码示例:

package com.wj;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

/**
 * 使用Java API来开发Flink的批处理应用程序
 */
public class BatchWCJavaApp {
    public static void main(String[] args) throws Exception {
        String input = "C:\\Users\\Administrator\\Desktop\\hello.txt";

        //step1: 获取运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //step2: read data
        DataSource<String> text = env.readTextFile(input);

        //step3: transform
        text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] tokens = value.toLowerCase().split("\t");
                for (String token : tokens) {
                    if (token.length()>0){
                        collector.collect(new Tuple2<String,Integer>(token,1));
                    }
                }
            }
        }).groupBy(0).sum(1).print(); //0表示key,1表示value

    }
}

Flink+Scala

前置条件:Maven 3.0.4(or higher) and java 8.x

mvn archetype:generate  -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-scala -DarchetypeVersion=1.7.0 -DarchetypeCatalog=local

scala代码示例:

package com.wj
import org.apache.flink.api.scala.ExecutionEnvironment
/**
 * 使用Scala开发Flink的批处理应用程序
 */
object BatchWCScalaApp {
  def main(args: Array[String]): Unit = {
    val input = "C:\\Users\\Administrator\\Desktop\\hello.txt"
    val env = ExecutionEnvironment.getExecutionEnvironment
    val text = env.readTextFile(input)
    //引入隐式转换
    import org.apache.flink.api.scala._
    text.flatMap(_.toLowerCase.split("\t"))
      .filter(_.nonEmpty)
      .map((_,1))
      .groupBy(0)
      .sum(1)
      .print()
  }
}

Flink Java vs Scala

  1. 算子 map filter
  2. 简洁性

流式编程Java

启动一个本地服务并监听端口:nc -lk 999

java把的流式编程代码示例:

package com.wj;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * 使用Java API来开发Flink的实时处理应用程序
 * wc统计的数据我们源自于socket
 */
public class StreamingWCJavaApp {
    public static void main(String[] args) throws Exception {
        //step1:获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //step2:读取数据
        DataStreamSource<String> text = env.socketTextStream("localhost", 9999);

        //step3:transform
        text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(",");
                for (String token : tokens) {
                    if (token.length()>0){
                        collector.collect(new Tuple2<String,Integer>(token,1));
                    }
                }
            }
        }).keyBy(0).timeWindow(Time.seconds(5)).sum(1).print().setParallelism(1);
        //流式编程里面一定要执行
        env.execute("StreamingWCJavaApp");
    }
}

对上面代码重构示例:

package com.wj;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
 * 使用Java API来开发Flink的实时处理应用程序
 * wc统计的数据我们源自于socket
 */
public class StreamingWCJava02App {
    public static void main(String[] args) throws Exception {
        //获取参数
        int port = 0;
        try {
            ParameterTool tool = ParameterTool.fromArgs(args);
            port = tool.getInt("port");
        } catch (Exception e) {
            System.err.println("系统端口未设置,使用默认端口9999");
            port = 9999;
        }
        //step1:获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //step2:读取数据
        DataStreamSource<String> text = env.socketTextStream("localhost", port);

        //step3:transform
        text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(",");
                for (String token : tokens) {
                    if (token.length()>0){
                        collector.collect(new Tuple2<String,Integer>(token,1));
                    }
                }
            }
       }).keyBy(0).timeWindow(Time.seconds(5)).sum(1).print().setParallelism(1);
        //流式编程里面一定要执行
        env.execute("StreamingWCJavaApp");
    }
}

项目参数配置:

在这里插入图片描述

流式编程Scala

代码示例:

package com.wj

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
/**
 * 使用Scala开发Flink的实时处理应用程序
 */
object SteamingWCScalaApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //引入隐式转换
    import org.apache.flink.api.scala._
    val text = env.socketTextStream("localhost", 9999)
    text
      .flatMap(_.split(","))
      .map((_,1))
      .keyBy(0)
      .timeWindow(Time.seconds(5))
      .sum(1)
      .print()
      .setParallelism(1)

    env.execute("SteamingWCScalaApp")
  }
}

编程模型及核心概念

Flink核心Api

大数据处理的流程:

MapReduce: input -> map(reduce) -> output

Storm: input -> Spout/Bolt -> output

Spark: input -> transformation/action -> output

Flink: intput -> transformation/sink -> output

Dataset(批处理) and DataStream(流处理) 是不可变集合,不能添加和删除,必须要有一个源头DataSource,可以通过map,filter等转换算子会产生新的Dataset and DataStream。

Flink编程模型

  1. 获取执行的上下文也就是环境
  2. 加载你初始化的数据
  3. 指定transformation在这些个数据上
  4. 指定你计算出的结果你要写到哪里去(sink)
  5. 触发这个编程的执行

Flink延迟执行

只有执行excute()方法后,你的所有操作才得到执行。不管你的程序是在本地还是执行在集群环境上。

指定key之字段选择

Java版代码示例如下,Scala也是类似,不同区别就是Scala中的offset是使用_1表示第一个元素。Java版中是0表示第一个元素开始。

package com.wj.course;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * 使用Java API来开发Flink的实时处理应用程序
 * wc统计的数据我们源自于socket
 */
public class StreamingWCJavaApp {
    public static void main(String[] args) throws Exception {
        //step1:获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //step2:读取数据
        DataStreamSource<String> text = env.socketTextStream("localhost", 9999);

        //step3:transform
        text.flatMap(new FlatMapFunction<String, WC>() {
            @Override
            public void flatMap(String value, Collector<WC> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(",");
                for (String token : tokens) {
                    if (token.length()>0){
                        collector.collect(new WC(token.trim(),1));
                    }
                }
            }
        })//.keyBy("word")
                .keyBy(new KeySelector<WC, String>() {
                    @Override
                    public String getKey(WC wc) throws Exception {
                        return wc.word;
                    }
                })
                .timeWindow(Time.seconds(5))
                .sum("count")
                .print()
                .setParallelism(1);

        //流式编程里面一定要执行
        env.execute("StreamingWCJavaApp");
    }

    public static class WC{
        private String word;
        private int count;

        public WC(){}

        public WC(String word,int count){
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return "WC{" +
                    "word='" + word + '\'' +
                    ", count=" + count +
                    '}';
        }

        public String getWord() {
            return word;
        }

        public void setWord(String word) {
            this.word = word;
        }

        public int getCount() {
            return count;
        }

        public void setCount(int count) {
            this.count = count;
        }
    }
}

DataSet API开发描述

简要概述:

Source: 源/源头

​ reading files

​ local collections

​ Source => Flink(transformations) ==> Sink

Sink: 目的地

​ (distributed) files

​ or to standard output

DataSource

基于文件

基于集合

代码示例:

scala版

package com.wj.flink.datasource

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.configuration.Configuration

object DataSetDataSourceApp {

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    //fromCollection(env)

//    textFile(env)
    //csvFile(env)
//    readRecursiveFiles(env)
    readCompressionFiles(env)

  }

  def readCompressionFiles(env:ExecutionEnvironment): Unit ={
    val filePath = "C:\\Users\\Administrator\\Desktop\\inputs\\compression"
    env.readTextFile(filePath).print()
  }

  //递归文件的读取
  def readRecursiveFiles(env:ExecutionEnvironment): Unit ={
    val filePath = "C:\\Users\\Administrator\\Desktop\\inputs\\nested"
    env.readTextFile(filePath).print()
    println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
    val parameters = new Configuration()
    parameters.setBoolean("recursive.file.enumeration",true)
    env.readTextFile(filePath).withParameters(parameters).print()
  }


  //csv文件的读取
  def csvFile(env: ExecutionEnvironment): Unit ={
    import org.apache.flink.api.scala._
    val filePath = "C:\\Users\\Administrator\\Desktop\\inputs\\people.csv"
//    env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print();
//    env.readCsvFile[(String,Int)](filePath,ignoreFirstLine = true,includedFields = Array(0,1)).print();

//    case class MyCaseClass(name:String,age:Int)
//    env.readCsvFile[MyCaseClass](filePath,ignoreFirstLine = true,includedFields = Array(0,1)).print();

    env.readCsvFile[Person](filePath,ignoreFirstLine = true,pojoFields = Array("name","age","work")).print();



  }

  def textFile(env:ExecutionEnvironment): Unit ={
    //读文件
//    val filePath = "C:\\Users\\Administrator\\Desktop\\hello.txt"
//    env.readTextFile(filePath).print()

    //读文件夹
    val filePath = "C:\\Users\\Administrator\\Desktop\\inputs"
    env.readTextFile(filePath).print()
  }

  def fromCollection(env:ExecutionEnvironment): Unit ={

    import org.apache.flink.api.scala._
    val data = 1 to 10
    env.fromCollection(data).print()
  }
}

java版

package com.wj.flink.datasource;

import org.apache.flink.api.java.ExecutionEnvironment;

import java.util.Arrays;
import java.util.concurrent.ExecutionException;

public class JavaDataSetDataSourceApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        formCollection(env);
        textFile(env);

    }


    public static void textFile(ExecutionEnvironment env) throws Exception {
        String filePath = "C:\\Users\\Administrator\\Desktop\\hello.txt";
        env.readTextFile(filePath).print();
        System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~");
        filePath = "C:\\Users\\Administrator\\Desktop\\inputs";
        env.readTextFile(filePath).print();
    }

    public static void formCollection(ExecutionEnvironment env )throws Exception {
        env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)).print();
    }
}

Transformation

DataSetTransformatiionApp.scala

package com.wj.flink.datasource

import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

import scala.collection.mutable.ListBuffer

object DataSetTransformatiionApp {

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
//    flatMapFunction(env)
//    distinctFunction(env)
//    joinFunction(env)
//    outerJoinFunction(env)
    crossFunction(env)
  }

  //笛卡尔积
  def crossFunction(env:ExecutionEnvironment): Unit ={
    val info1 = List("曼联","曼城")
    val info2 = List(3,1,0)
    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)
    data1.cross(data2).print()
  }

  //外连接
  def outerJoinFunction(env:ExecutionEnvironment): Unit ={
    val info1 = ListBuffer[(Int,String)]()  //编号  名字
    info1.append((1,"PK哥"))
    info1.append((2,"J哥"))
    info1.append((3,"小队长"))
    info1.append((4,"猪头胡"))

    val info2 = ListBuffer[(Int,String)]()  //编号  城市
    info2.append((1,"北京"))
    info2.append((2,"上海"))
    info2.append((3,"成都"))
    info2.append((5,"杭州"))

    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)

    //注意where指左边的字段,equalTo指右边的字段
    //左外连接是以左边的数据为基准的,左边的数据会全部出来
//    data1.leftOuterJoin(data2).where(0).equalTo(0).apply((first,second)=>{
//      if (second == null){
//        (first._1,first._2,"-")
//      }else{
//        (first._1,first._2,second._2)
//      }
//    }).print()
    //右外连接
//    data1.rightOuterJoin(data2).where(0).equalTo(0).apply((first,second)=>{
//      if (first == null){
//        (second._1,"-",second._2)
//      }else{
//        (first._1,first._2,second._2)
//      }
//    }).print()
    //全外连接
    data1.fullOuterJoin(data2).where(0).equalTo(0).apply((first,second)=>{
      if (first == null){
        (second._1,"-",second._2)
      }else if(second==null){
        (first._1,first._2,"-")
      }else{
        (first._1,first._2,second._2)
      }
    }).print()

  }

  //内连接
  def joinFunction(env:ExecutionEnvironment): Unit ={
    val info1 = ListBuffer[(Int,String)]()  //编号  名字
    info1.append((1,"PK哥"))
    info1.append((2,"J哥"))
    info1.append((3,"小队长"))
    info1.append((4,"猪头胡"))

    val info2 = ListBuffer[(Int,String)]()  //编号  城市
    info2.append((1,"北京"))
    info2.append((2,"上海"))
    info2.append((3,"成都"))
    info2.append((5,"杭州"))

    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)

    //注意where指左边的字段,equalTo指右边的字段
    data1.join(data2).where(0).equalTo(0).apply((first,second)=>{
      (first._1,first._2,second._2)
    }).print()
  }

  //去重
  def distinctFunction(env:ExecutionEnvironment): Unit ={
    val info = ListBuffer[String]()
    info.append("hadoop,spark")
    info.append("hadoop,flink")
    info.append("flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).distinct().print()

  }

  //一个元素产生多个元素
  def flatMapFunction(env:ExecutionEnvironment): Unit ={
    val info = ListBuffer[String]()
    info.append("hadoop,spark")
    info.append("hadoop,flink")
    info.append("flink,flink")
    val data = env.fromCollection(info)
//    data.print()
//    data.map(_.split(",")).print()
//    data.flatMap(_.split(",")).print()
    data.flatMap(_.split(",")).map((_,1)).groupBy(0).sum(1).print()
  }

  def firstFunction(env:ExecutionEnvironment): Unit ={
    val info = ListBuffer[(Int,String)]()
    info.append((1,"Hadoop"))
    info.append((1,"Spark"))
    info.append((1,"Flink"))
    info.append((2,"Java"))
    info.append((2,"Spring Boot"))
    info.append((3,"Linux"))
    info.append((4,"Vue"))
    val data = env.fromCollection(info)
//    data.first(3).print()
//    data.groupBy(0).first(2).print()//分组以后,求组内的前几个
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  //DataSource 100个元素,把结果存储到数据库中
  def mapPartitionFunction(env:ExecutionEnvironment): Unit ={

    val students = new ListBuffer[String]
    for(i<- 1 to 100){
      students.append("students: "+i)
    }

    val data = env.fromCollection(students).setParallelism(4)
//    data.map(x=>{
//      //把每一个元素存储到数据库中去,肯定需要先获取到一个connection
//      val connection = DBUtils.getConnection()
//      println(connection+".......")
//
//      //TODO ...保存数据到DB
//      DBUtils.returnConnection(connection)
//    }).print()

    //注意: 用mapPartition 不用每次都要去建立连接,从而不会导致系统资源消耗太多,根据env的setParallelism(4)并行度有关系
    data.mapPartition(x=>{
      val connection = DBUtils.getConnection()

      println(connection+".......")
      //TODO ...保存数据到DB
      DBUtils.returnConnection(connection)
      x
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit ={
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_+1).filter(_>5).print()
  }

  def mapFunction(env:ExecutionEnvironment): Unit ={
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
//    data.map((x:Int)=>x+1).print()
//    data.map(x=>x+1).print()
    data.map(_+1).print()
  }
}

JavaDataSetTransformationApp.java

package com.wj.flink.datasource;

import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.MapPartitionFunction;
import org.apache.flink.api.common.operators.Order;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class JavaDataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
        firstFunction(env);
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = new ArrayList<Tuple2<Integer, String>>();

        info.add(new Tuple2(1,"Hadoop"));
        info.add(new Tuple2(1,"Spark"));
        info.add(new Tuple2(1,"Flink"));
        info.add(new Tuple2(2,"Java"));
        info.add(new Tuple2(2,"Spring Boot"));
        info.add(new Tuple2(3,"Linux"));
        info.add(new Tuple2(4,"Vue"));

        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.first(3).print();
        System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~");
        data.groupBy(0).first(2).print();
        System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~");
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();


    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
        DataSource<Integer> data = env.fromCollection(list);

//        data.map(new MapFunction<Integer, Integer>() {
//            public Integer map(Integer integer) throws Exception {
//                String connection = DBUtils.getConnection();
//                System.out.println("connection:"+connection);
//                DBUtils.returnConnection(connection);
//                return integer;
//            }
//        }).print();

        data.mapPartition(new MapPartitionFunction<Integer, Integer>() {
            public void mapPartition(Iterable<Integer> iterable, Collector<Integer> collector) throws Exception {
                String connection = DBUtils.getConnection();
                System.out.println("connection:"+connection);
                DBUtils.returnConnection(connection);
            }
        }).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
        DataSource<Integer> data = env.fromCollection(list);
        data.map(new MapFunction<Integer, Integer>() {

            public Integer map(Integer integer) throws Exception {
                return integer+1;
            }
        }).filter(new FilterFunction<Integer>() {
            public boolean filter(Integer integer) throws Exception {
                return integer>5;
            }
        }).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
        DataSource<Integer> data = env.fromCollection(list);
        data.map(new MapFunction<Integer, Integer>() {

            public Integer map(Integer integer) throws Exception {
                return integer+1;
            }
        }).print();
    }
}

DBUtils.scala

package com.wj.flink.datasource

import scala.util.Random

object DBUtils {
  def getConnection()={
    new Random().nextInt(10)+""
  }
  def returnConnection(connection:String): Unit ={
  }
}

Slink

代码示例如下scala版

package com.wj.flink.datasource

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode

object DataSetSinkApp {

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment

    val data = 1 to 10
    val text = env.fromCollection(data)
    val filePath = "C:\\Users\\Administrator\\Desktop\\sinkout"
    text.writeAsText(filePath,WriteMode.OVERWRITE).setParallelism(2)

    env.execute("DataSetSinkApp")
  }

}

Flink计算器

代码示例如下scala版

package com.wj.flink.datasource

import org.apache.flink.api.common.accumulators.LongCounter
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.java.ExecutionEnvironment
import org.apache.flink.configuration.Configuration
import org.apache.flink.core.fs.FileSystem.WriteMode

/**
 * 基于Flink编程的计数器开发三部曲
 * step1:定义一个计数器
 * step2:注册计数器
 * step3:获取计算器
 */
object CounterApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment

    val data = env.fromElements("hadoop","spark","flink","pyspark","strom")

//    data.map(new RichMapFunction[String,Long] {
//      var counter = 0l
//      override def map(in: String): Long = {
//        counter+=1
//        println("counter : "+counter)
//        counter
//      }
//    }).setParallelism(3).print()

//    data.print()
    val info = data.map(new RichMapFunction[String,String] {
      //step1:定义一个计数器
      val counter = new LongCounter()

      override def open(parameters: Configuration): Unit = {
        //step2:注册计数器
        getRuntimeContext.addAccumulator("ele-counts-scala",counter)
      }
      override def map(in: String): String = {
        counter.add(1)
        in
      }
    })

    info.writeAsText("C:\\Users\\Administrator\\Desktop\\sink-scala-counter-out",
      WriteMode.OVERWRITE).setParallelism(3)
    val jobResult = env.execute("CounterApp")
   //step3:获取计算器
    val num = jobResult.getAccumulatorResult[Long]("ele-counts-scala")
    println("num: "+num)

  }

}

Flink分布式缓存

代码示例如下:scala版

package com.wj.flink.datasource
import org.apache.commons.io.FileUtils
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.java.ExecutionEnvironment
import org.apache.flink.configuration.Configuration

/**
 * step1:注册一个本地文件、hdfs文件
 *
 * step2:在open方法中获取到分布式缓存的内容即可
 */
object DistributedCacheApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val filePath = "C:\\Users\\Administrator\\Desktop\\inputs\\hello.txt"
    //step1:注册一个本地文件、hdfs文件
    env.registerCachedFile(filePath,"pk-scala-dc")

    import org.apache.flink.api.scala._
    val data = env.fromElements("hadoop","spark","flink","pyspark","storm")


    data.map(new RichMapFunction[String,String] {
      //step2:在open方法中获取到分布式缓存的内容即可
      override def open(parameters: Configuration): Unit = {
        val dcFile = getRuntimeContext.getDistributedCache().getFile("pk-scala-dc")
        val lines = FileUtils.readLines(dcFile) //java
        /**
         * 此时会出现异常,java集合和scala集合不兼容的问题
         */
        import scala.collection.JavaConverters._
        for (ele <- lines.asScala){  //scala
          println(ele)
        }
      }
      override def map(in: String): String = {
        in
      }
    }).print()

  }
}

DataStream API开发

DataSource

包含来自文件,集合,元素,自定义datasource

package com.wj.flink.datasource

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

import org.apache.flink.api.scala._
object DataStreamSourceApp{
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    socketFunction(env)

//    nonparallelSourceFunction(env)

//    parallelSourceFunction(env)
    richparallelSourceFunction(env)
    env.execute("DataStreamSourceApp")
  }

  /**
   * 使用自定义的datasource
   * 注意并行度可以设置大于1
   * @param env
   */
  def richparallelSourceFunction(env:StreamExecutionEnvironment): Unit ={
    val data = env.addSource(new CustomRichParallelSourceFunction()).setParallelism(2)
    data.print()
  }

  /**
   * 使用自定义的datasource
   * 注意并行度可以设置大于1
   * @param env
   */
  def parallelSourceFunction(env:StreamExecutionEnvironment): Unit ={
    val data = env.addSource(new CustomParallelSourceFunction()).setParallelism(2)
    data.print()
  }

  /**
   * 使用自定义的datasource
   * 注意并行度只能设置成 1
   * @param env
   */
  def nonparallelSourceFunction(env:StreamExecutionEnvironment): Unit ={
    val data = env.addSource(new CustomNonparallelSourceFunction()).setParallelism(1)
    data.print().setParallelism(1)
  }

  /**
   * 数据源来自于socket
   * @param env 上下文
   *
   * nc -lk 9999 来启动一个服务
   */
  def socketFunction(env:StreamExecutionEnvironment): Unit ={
    val data = env.socketTextStream("localhost", 9999)
    data.print().setParallelism(1)  //设置并行度,不同地方设置效果是不一样的!
  }
}

CustomNonparallelSourceFunction.scala

package com.wj.flink.datasource

import org.apache.flink.streaming.api.functions.source.SourceFunction

/**
 * 自定义的datasource, 不是并行的
 */
class CustomNonparallelSourceFunction extends SourceFunction[Long] {
  var count = 1L
  var isRunning = true

  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning){
      ctx.collect(count)
      count+=1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }
}

CustomParallelSourceFunction.scala

package com.wj.flink.datasource

import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}

/**
 * 自定义个可以设置并行度的datasource
 */
class CustomParallelSourceFunction extends ParallelSourceFunction[Long]{
  var count = 1L
  var isRunning = true
  override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning){
      sourceContext.collect(count)
      count+=1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning=false
  }
}

CustomRichParallelSourceFunction.scala

package com.wj.flink.datasource

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}

class CustomRichParallelSourceFunction extends RichParallelSourceFunction[Long] {
  var count = 1L
  var isRunning = true

  override def open(parameters: Configuration): Unit = super.open(parameters)

  override def close(): Unit = super.close()

  override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning){
      sourceContext.collect(count)
      count+=1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }
}

Transformation

代码示例如下:

package com.wj.flink.datasource
import java.{lang, util}
import org.apache.flink.streaming.api.collector.selector.OutputSelector
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

object DataStreamTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    filterFunction(env)
//    unionFunction(env)
    splitSelectunionFunction(env)
    env.execute("DataStreamTransformationApp")
  }
  def splitSelectunionFunction(env:StreamExecutionEnvironment): Unit ={
    import org.apache.flink.api.scala._
    val data = env.addSource(new CustomNonparallelSourceFunction)
   //返回的类型是: SplitStream[T]
    var splits = data.split(new OutputSelector[Long] {
      val list = new util.ArrayList[String]()
      override def select(value: Long): lang.Iterable[String] = {
        if (value%2==0){
          list.add("even")
        }else{
          list.add("odd")
        }
      list
      }
    })
    splits.select("even","odd").print().setParallelism(1)
  }
  def unionFunction(env:StreamExecutionEnvironment): Unit ={
    import org.apache.flink.api.scala._
    val data1 = env.addSource(new CustomNonparallelSourceFunction)
    val data2 = env.addSource(new CustomNonparallelSourceFunction)
    data1.union(data2).print().setParallelism(1)
  }

  def filterFunction(env:StreamExecutionEnvironment): Unit ={
    import org.apache.flink.api.scala._
    val data = env.addSource(new CustomNonparallelSourceFunction)

    data.map(x=>{
      println("received: "+x)
      x
    }).filter(_%2==0).print().setParallelism(1)
  }
}

自定义Sink

自定义Sink总结:
1.RichSinkFunction T就是你想要写入对象的类型
2.重写方法
open/close 生命周期方法
invoke 每条记录执行一次

代码示例:

首先引入mysql-connector-java包

<dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.25</version>
    </dependency>

接着自定类SinkToMySQL

package com.wj.flink.datasource;

import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;

/**
 * 把flink数据写入到mysql中
 *
 * 自定义Sink总结:
 * 1.RichSinkFunction<T> T就是你想要写入对象的类型
 * 2.重写方法
 *         open/close  生命周期方法
 *         invoke   每条记录执行一次
 */
public class SinkToMySQL extends RichSinkFunction<Student> {

    Connection connection;
    PreparedStatement pstmt;

    private Connection getConnection(){
        Connection conn = null;
        try {
            Class.forName("com.mysql.jdbc.Driver");
            String url = "jdbc:mysql://localhost:3306/imooc_flink";
            conn = DriverManager.getConnection(url,"root","root");
        } catch (Exception e) {
            e.printStackTrace();
        }
        return conn;
    }

    /**
     * 在open方法中建立connection
     * @param parameters
     * @throws Exception
     */
    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        connection = getConnection();
        String sql = "insert into student(id,name,age) values(?,?,?)";
        pstmt = connection.prepareStatement(sql);

        System.out.println("open");
    }

    //每记录插入时调用一次
    public void invoke(Student value, Context context) throws Exception {
        System.out.println("invoke~~~~~~~~~~~~~~~~~~~~");
        //为前面的占位符赋值
        pstmt.setInt(1,value.getId());
        pstmt.setString(2,value.getName());
        pstmt.setInt(3,value.getAge());
        pstmt.executeUpdate();
    }

    /**
     * 在close中要释放资源
     * @throws Exception
     */
    @Override
    public void close() throws Exception {
        super.close();
        if (pstmt!=null){
            pstmt.close();
        }
        if (connection!=null){
            connection.close();
        }
    }
}

最后编写主方法测试

package com.wj.flink.datasource;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

/**
 * 测试自定义sink并把数据写入mysql
 *
 * windows用 nc -l -p 9999 命令监听端口
 */
public class JavaCustomSinkToMySQL {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> source = env.socketTextStream("localhost", 7777);
        SingleOutputStreamOperator<Student> studentStream = source.map(new MapFunction<String, Student>() {
            public Student map(String value) throws Exception {
                String[] splits = value.split(",");
                Student stu = new Student();
                stu.setId(Integer.parseInt(splits[0]));
                stu.setName(splits[1]);
                stu.setAge(Integer.parseInt(splits[2]));
                return stu;
            }
        });

        //把数据写出去,到mysql
        studentStream.addSink(new SinkToMySQL());
        env.execute("JavaCustomSinkToMySQL");
    }
}

Table & SQL API

DataSet&DataStream API

  1. 熟悉两套API:Dataset/DataStream Java/Scala

    MapReduce ==> Hive SQL

    Spark ==> Spark SQL

    Flink ==> SQL

  2. Flink是支持批处理/流处理,如何做到API层面的统一

集成环境

    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-table_2.11</artifactId>
      <version>${flink.version}</version>
    </dependency>

scala代码示例:

package com.wj.flink.datasource

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.types.Row

object TableSQLAPI {

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val tableEnv = TableEnvironment.getTableEnvironment(env)

    val filePath = "C:\\Users\\Administrator\\Desktop\\table-api\\sales.csv"

    import org.apache.flink.api.scala._

    //已经拿到DataSet
    val csv = env.readCsvFile[SalesLog](filePath,ignoreFirstLine = true)
//    csv.print()

    //DataSet ==> Table
    val salesTable = tableEnv.fromDataSet(csv)
    //注册一张表即 Table ==> table
    tableEnv.registerTable("sales",salesTable)
    //sql查询
    val resultTable = tableEnv.sqlQuery("select customerId,sum(amountPaid) money from sales group by customerId")

    //把结果输出
    tableEnv.toDataSet[Row](resultTable).print()

  }
  case class SalesLog(transactionId:String,
                      customerId:String,
                      itemId:String,
                      amountPaid:Double)

}

java示例:

package com.wj.flink.datasource;

import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.BatchTableEnvironment;
import org.apache.flink.types.Row;
public class JavaTableSQLAPI {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        BatchTableEnvironment tableEnv = BatchTableEnvironment.getTableEnvironment(env);

        String filePath = "C:\\Users\\Administrator\\Desktop\\table-api\\sales.csv";
        DataSource<Sales> csv = env.readCsvFile(filePath)
                .ignoreFirstLine()
                .pojoType(Sales.class,"transactionId","customerId","itemId","amountPaid");
//        csv.print();
        Table sales = tableEnv.fromDataSet(csv);
        tableEnv.registerTable("sales",sales);

        Table resultTable = tableEnv.sqlQuery("select customerId,sum(amountPaid) money from sales group by customerId");
        DataSet<Row> rowDataSet = tableEnv.toDataSet(resultTable, Row.class);
        rowDataSet.print();

    }
    public static class Sales{
        public String transactionId;
        public String customerId;
        public String itemId;
        public Double amountPaid;
    }
}

Flink的Time理解

  • 事件时间

    每条日志里就有条时间戳,最好的最准确的

  • 摄取时间

    进入到flink的时间,source采取到的时间,可靠性更好

  • 处理时间

    不一定准确的,依赖本地clock

Window

概念:分为带key和不带key,带key是已多任务并行处理元素,不带key的并行度为1

窗口分配器:定义如何将数据分配给窗口

一个元素有可能被分配到一个或者多个窗口

窗口类型:

  • tumbling windows 滚动窗口

    固定大小,窗口不会重叠

  • sliding windows 滑动窗口

    固定大小,有可能会重叠,这种情况可能被分配到多个窗口

  • session windows 会话窗口

  • global windows 全局窗口

Flink对于window来说有两大类

如下图:

在这里插入图片描述

滚动创建和滑动窗口代码示例:

package com.wj.flink.datasource

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.api.scala._
object WindowsApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("localhost", 9999)
    text.flatMap(_.split(","))
      .map((_,1))
      .keyBy(0)
//      .timeWindow(Time.seconds(5))  //默认采用处理时间   滚动窗口,元素不会重复
      .timeWindow(Time.seconds(10),Time.seconds(5))//滑动窗口,元素可能会重复
      .sum(1)
      .print()
      .setParallelism(1)

    env.execute("WindowsApp")
  }
}

window functions之reducefunction

代码示例:

package com.wj.flink.datasource

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time

object WindowsReduceApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("localhost", 9999)
    //原来传递进来的数据是字符串,此处我们就使用数值类型,通过数值类型来演示增量效果
    text.flatMap(_.split(","))
      .map(x=>(1,x.toInt)) //1,2,3 ==》(1,1)(1,2)(1,3)
      .keyBy(0)  //因为key都是1,所以所有元素都到一个task去执行
      .timeWindow(Time.seconds(5))
      .reduce((v1,v2)=>{  //不是等待窗口所有的数据进行一次性处理,而是数据两两处理
        println(v1+"......"+v2)
        (v1._1,v1._2+v2._2)
      })
      .print()
      .setParallelism(1)

    env.execute("WindowsReduceApp")
  }
}

Connector之Kafka

Flink对接kafka作为source使用

代码示例如下:

package com.wj.flink.datasource

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.CheckpointingMode

/**
 * 对接kafka数据源
 */
object KafkaConnectorConsumerApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //checkpoint常用设置参数,不管是生产者还是消费者都可以使用
    env.enableCheckpointing(4000)
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    env.getCheckpointConfig.setCheckpointTimeout(1000)
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)

    val topic = "wjtest"

    val properties = new Properties()
    properties.setProperty("bootstrap.servers","192.168.162.128:9092")
    properties.setProperty("group.id","test")
    //FlinkKafkaConsumer 自动帮我们管理offset的提交
    val data = env.addSource(new FlinkKafkaConsumer[String](topic,new SimpleStringSchema(),
      properties))

    data.print()


    env.execute("KafkaConnectorConsumerApp")
  }

}

Flink对接kafka作为sink使用

代码示例如下:

package com.wj.flink.datasource

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper


object KafkaConnectorProducerApp {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val data = env.socketTextStream("localhost", 9999)

    val topic = "wjtest"

    val properties = new Properties()
    properties.setProperty("bootstrap.servers","192.168.162.128:9092")
    val kafkaSink = new FlinkKafkaProducer[String](topic,
      new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema()),properties)
    //把数据sink到kafka上去
    data.addSink(kafkaSink)

    env.execute("KafkaConnectorProducerApp")
  }
}

Flink部署

单机部署

前置条件:JDK8 、Maven3

通过下载Flink源码进行编译,不是使用直接下载二进制包

下载到:

  1. 服务器:wget https//github.com/apache/flink/archive/release-1.7.0.tar.gz
  2. 本地:https://github.com/apache/flink/archive/release-1.7.0.tar.gz

编译命令:mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.6.0-cdh5.15.1 -Dfast

把编译好的文件E:\IDEA_WORK_SPACE\flink-release-1.7.0\flink-release-1.7.0\flink-dist\target\flink-1.7.0-bin 放到服务器上执行以下命令:

./bin/start-cluster.sh

停止命令: ./stop-cluster.sh

浏览器窗口查看管理台:http://192.168.162.128:8081/#/overview

运行示例:

  1. 在服务器上 执行 nc -lk 9000
  2. 提交任务:./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000

Standalone分布式

Flink 同一个目录,集群里面机器 部署的目录都是一样的

每个机器需要添加ip和hostname的映射关系

条件:

  1. Java1.8.x or higher

  2. ssh 多个机器之间要互通

    ping hadoop000

    ping hadoop001

    ping hadoop002

  3. 配置flink-conf.yaml

    jobmanager.rpc.address: 10.0.0.1 配置主节点ip

    jobmanager 主节点

    taskmanager 从节点

    slaves 每一行配置一个ip/hosts

  4. 常用配置

    jobmanager.rpc.address master节点的地址

    jobmanager.heap.mb jobmanager节点可用的内存

    taskmanager.heap.mb taskmanager节点可用的内存

    taskmanager.numberOfTaskSlotss 每个机器可用的cpu个数,决定并行度

    parallelism.default 任务的并行度

    taskmanager.tmp.dirs taskmanager的临时数据存储目录

ON YARN是企业级用的最多的方式 *****

有两种方式:

在这里插入图片描述

Flink中常用的优化策略

  1. 资源

  2. 并行度

    默认是1 适当调整:好几种 ==》 项目实战

  3. 数据倾斜

    100task 98-99跑完了, 1-2很慢 ==》能跑完、跑不完

    group by: 二次聚合

    ​ random_key + random

    ​ key - random

    join on xxx=xxx

    ​ repartition-repartition strategy 大大

    ​ broadcast-forward strategy 大小

项目综合实战

接入的数据类型就是日志

离线:Flume==>HDFS

实时:Kafka==>流处理引擎==>ES==>Kibana

项目功能

  1. 统计一分钟内每个域名访问产生的流量

    Flink接受Kafka的数据进行处理

  2. 统计一分钟内每个用户产生的流量

    域名和用户是有对应关系的

    Flink接受Kafka的进行 + Flink读取域名和用户的配置数据 进行处理

数据来源

Mock ******

项目架构

在这里插入图片描述

Mock数据:务必掌握的

​ 数据敏感

​ 多团队协作,你依赖了其他团队提供的服务或者接口

​ 通过Mock的方式往Kafka的broker里面发送数据

​ Java/Scala Code: producer

​ Kafka控制台消费者:consumer

项目需求:最近一分钟每个域名对应的流量

项目代码示例

对接Kafka来mock数据

package com.wj.flink.project;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import java.util.Random;

public class PKKafkaProducer {
    public static void main(String[] args) throws Exception {

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers","192.168.162.128:9092");
        properties.setProperty("key.serializer", StringSerializer.class.getName());
        properties.setProperty("value.serializer", StringSerializer.class.getName());

        KafkaProducer<String, String> producer = new KafkaProducer<String, String>(properties);

        String topic = "wjtest";

        //通过死循环一直不停往Kafka的Broker里面产生数据
        while (true){
            //构建随机数据
            StringBuilder builder = new StringBuilder();
            builder.append("imooc").append("\t")
                    .append("CN").append("\t")
                    .append(getLevels()).append("\t")
                    .append(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date())).append("\t")
                    .append(getIps()).append("\t")
                    .append(getDomains()).append("\t")
                    .append(getTrafffic()).append("\t");

            System.out.println(builder.toString());
            //发送数据到Kafka
            producer.send(new ProducerRecord<String, String>(topic,builder.toString()));

            Thread.sleep(2000);
        }
    }

    private static int getTrafffic() {
        return new Random().nextInt(10000);
    }

    private static String getDomains() {
        String[] domins = {
                "v1.go2yd.com",
                "v2.go2yd.com",
                "v3.go2yd.com",
                "v4.go2yd.com",
                "vmi.go2yd.com"
        };
        return domins[new Random().nextInt(domins.length)];
    }

    private static String getIps() {
        String[] ips = {
                "223.104.18.110",
                "113.101.75.194",
                "27.17.127.135",
                "183.225.139.16",
                "112.1.66.34",
                "175.148.211.190",
                "183.227.58.21",
                "59.83.198.84",
                "117.28.38.28",
                "117.59.39.169"
        };
        return ips[new Random().nextInt(ips.length)];
    }

    //生产level数据
    public static String getLevels(){
        String[] levels = {"M", "E"};
        return levels[new Random().nextInt(levels.length)];
    }
}

清洗数据写入ES

package com.wj.flink.project

import java.text.SimpleDateFormat
import java.util
import java.util.Properties

import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.shaded.zookeeper.org.apache.zookeeper.jute.compiler.generated.SimpleCharStream
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.flink.util.Collector
import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests
import org.slf4j.LoggerFactory
object LogAnalysis {

  def main(args: Array[String]): Unit = {
    //在生成上记录日志
    val logger = LoggerFactory.getLogger("LogAnalysis")

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //设置处理时间模式是日志产生时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val topic = "wjtest"
    val properties = new Properties()
    properties.setProperty("bootstrap.servers","192.168.162.128:9092")
    properties.setProperty("group.id","test")

    val consumer = new FlinkKafkaConsumer[String](topic, new SimpleStringSchema(), properties)
    //接受kafka数据
    val data = env.addSource(consumer)

    val logData = data.map(x => {
      val splits = x.split("\t")
      val level = splits(2)
      val timeStr = splits(3)
      var time = 0l
      try {
        val sourceFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
        time = sourceFormat.parse(timeStr).getTime
      } catch {
        case e: Exception =>{
          logger.error(s"time parse error: $timeStr",e.getMessage)
        }
      }
      val domain = splits(5)
      val traffic = splits(6).toLong
      (level, time, domain, traffic)
    }).filter(_._2!=0).filter(_._1=="E")
        .map(x=>{
          (x._2,x._3,x._4)   //1 level(抛弃)  2 time 3 domain 4 traffic
        })

    /**
     * 在生成上进行业务处理的时候,一定要考虑处理的健壮性以及你数据的准确性
     * 脏数据或者是不符合业务规则的数据是需要全部过滤掉之后
     * 再进行相应业务逻辑的处理
     *
     * 对于我们业务来说,我们只需要统计level=E的即可
     * 对于level非E的,不作为我们业务指标的统计范畴
     *
     * 数据清洗:就是按照我们业务规则吧原始输入的数据进行一定业务规则处理
     * 使得满足我们的业务需求为准
     */

//    logData.print().setParallelism(1)

    //利用匿名内部类,获取水印来解决数据无序的问题
    var resultData = logData.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(Long, String, Long)] {
      val maxOutOfOrderness = 10000L

      var currentMaxTimestamp: Long = _  //scala里面 _是占位符
      override def getCurrentWatermark: Watermark = {
        new Watermark(currentMaxTimestamp - maxOutOfOrderness)
      }

      override def extractTimestamp(t: (Long, String, Long), l: Long): Long = {
        val timestamp = t._1
        currentMaxTimestamp = Math.max(timestamp,currentMaxTimestamp)
        timestamp
      }
    }).keyBy(1) //此处是按照域名进行keyBy的
        .window(TumblingEventTimeWindows.of(Time.seconds(60)))
        .apply(new WindowFunction[(Long, String, Long),(String, String, Long),Tuple,TimeWindow] { //参数:输入 输出 key window
          override def apply(key: Tuple, window: TimeWindow, input: Iterable[(Long, String, Long)], out: Collector[(String, String, Long)]): Unit = {

            val domain = key.getField(0).toString
            var sum = 0l

            val iterator = input.iterator
            val timeArr = new util.ArrayList[Long]()
            val sourceFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm")
            while (iterator.hasNext){
              val next = iterator.next()
              sum+=next._3  //traffic求和

              //TODO  是能拿到你这个window里面的时间的 next._1
              timeArr.add(next._1)
            }

            /**
             * 第一个参数:这一分钟的时间  2019-09-09 20:20
             * 第二个参数:域名
             * 第三个参数:traffic的和
             */
            val time = sourceFormat.format(timeArr.get(0))
            out.collect((time,domain,sum))
          }
        })  //.print().setParallelism(1)

    //最终要把数据写入到ES中并在Kibana中展示结果
    val httpHosts = new util.ArrayList[HttpHost]
    httpHosts.add(new HttpHost("192.168.162.128",9200,"http"))

    val esSinkBuilder = new ElasticsearchSink.Builder[(String,String,Long)](httpHosts,
      new ElasticsearchSinkFunction[(String, String, Long)] {
        def createIndexRequest(element: (String, String, Long)):IndexRequest={
          val json = new util.HashMap[String, Any]()
          json.put("time",element._1)
          json.put("domain",element._2)
          json.put("traffics",element._3)

          val id = element._1+"-"+element._2

          Requests.indexRequest()
            .index("cdn")
            .`type`("traffic")
            .id(id)
            .source(json)
        }

        override def process(t: (String, String, Long), runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
          requestIndexer.add(createIndexRequest(t))
        }
      })

    esSinkBuilder.setBulkFlushMaxActions(1)
    //最后把数据写出到es
    resultData.addSink(esSinkBuilder.build())
    env.execute("LogAnalysis")
  }
}

ES部署

要求:

  1. CentOS7.x
  2. 非root hadoop

引入依赖:

    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-elasticsearch6_2.11</artifactId>
      <version>${flink.version}</version>
    </dependency>

下载并解压ES(不要用root用户)

tar -zxvf elasticsearch-6.6.2.tar.gz -C …/install/

修改配置文件elasticsearch.yml

network.host: 0.0.0.0

启动:

./elasticsearch

./elasticsearch -d 后台启动

最后访问地址:http://192.168.162.128:9200/ 成功即可

Kibana部署

下载解压:

tar zxvf kibana-6.6.2-linux-x86_64.tar.gz -C …/install/

配置文件kibana.yml修改:

server.host: “wangjun”

elasticsearch.hosts: [“http://wangjun:9200”]

启动:

./kibana

最后

访问http://192.168.162.128:5601/ 成功即可

创建索引相关命令:

curl -XPUT 'http://wangjun:9200/cdn'  //创建索引库
curl -H "Content-Type: application/json" -XPOST 'http://wangjun:9200/cdn/traffic/_mapping
"traffic":{
	"properties":{
		"domain":{"type":"text"},
		"traffics":{"type":"long"},
		"time":{"type":"data","format":"yyyy-MM-dd HH:mm"}
	}
}
'
//创建type  (!!!有问题 es6.x以后不推荐使用type)

需求:CDN业务

userid对应多个域名

用户id和域名的映射关系:

从日志里能拿到domain,还得从另外一个表(MYSQL)

里面去获取userid和domain的映射关系

Sql语句:

create table user_domain_config(
    id int unsigned auto_increment,
    user_id varchar(40) not null,
    domain varchar(40) not null,
    primary key (id)
);
insert into user_domain_config(user_id,domain) values 
('8000000','v1.go2yd.com');
insert into user_domain_config(user_id,domain) values 
('8000000','v2.go2yd.com');
insert into user_domain_config(user_id,domain) values 
('8000000','v3.go2yd.com');
insert into user_domain_config(user_id,domain) values 
('8000000','v4.go2yd.com');
insert into user_domain_config(user_id,domain) values 
('8000000','vmi.go2yd.com');

在做实时数据清洗的时候,不仅需要处理raw日志,还需要关联Mysql表里的数据。

自定义一个Filink去读取Mysql数据的数据源,然后把两个stream关联起来。

代码示例:

package com.wj.flink.project

import java.sql.{Connection, DriverManager, PreparedStatement}

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}

import scala.collection.mutable

/**
 * 自定义对接mysql的数据源
 */
class PKMySQLSource extends RichParallelSourceFunction[mutable.HashMap[String,String]]{

  var connection:Connection = null
  var ps:PreparedStatement = null

  //Open:建立连接
  override def open(parameters: Configuration): Unit = {

    val driver = "com.mysql.jdbc.Driver"
    Class.forName(driver)
    var url = "jdbc:mysql://localhost:3306/flink"
    var user = "root"
    var password = "root"
    connection = DriverManager.getConnection(url,user,password)

    val sql = "select user_id,domain from user_domain_config"
    ps = connection.prepareStatement(sql)
  }

  //释放资源
  override def close(): Unit = {
    if (ps!=null){
      ps.close()
    }
    if (connection!=null){
      connection.close()
    }
  }

  /**
   * 此处是代码的关键:要从mysql表中把数据读取出来转成Map进行数据的封装
   * @param sourceContext
   */
  override def run(sourceContext: SourceFunction.SourceContext[mutable.HashMap[String, String]]): Unit = {
    println("run function invoke.....")
    val set = ps.executeQuery()
    val map = new mutable.HashMap[String, String]()
    while (set.next()){
      map.put(set.getString(2),set.getString(1));
    }
    sourceContext.collect(map)
  }

  override def cancel(): Unit = {
    ???
  }
}

LogAnalysis02$.scala

package com.wj.flink.project

import java.text.SimpleDateFormat
import java.util
import java.util.Properties

import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.co.CoFlatMapFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.util.Collector
import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests
import org.slf4j.LoggerFactory

import scala.collection.mutable

object LogAnalysis02 {

  def main(args: Array[String]): Unit = {
    //在生成上记录日志
    val logger = LoggerFactory.getLogger("LogAnalysis")

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //设置处理时间模式是日志产生时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val topic = "wjtest"
    val properties = new Properties()
    properties.setProperty("bootstrap.servers","192.168.162.128:9092")
    properties.setProperty("group.id","test")

    val consumer = new FlinkKafkaConsumer[String](topic, new SimpleStringSchema(), properties)
    //接受kafka数据
    val data = env.addSource(consumer)

    val logData = data.map(x => {
      val splits = x.split("\t")
      val level = splits(2)
      val timeStr = splits(3)
      var time = 0l
      try {
        val sourceFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
        time = sourceFormat.parse(timeStr).getTime
      } catch {
        case e: Exception =>{
          logger.error(s"time parse error: $timeStr",e.getMessage)
        }
      }
      val domain = splits(5)
      val traffic = splits(6).toLong
      (level, time, domain, traffic)
    }).filter(_._2!=0).filter(_._1=="E")
        .map(x=>{
          (x._2,x._3,x._4)   //1 level(抛弃)  2 time 3 domain 4 traffic
        })

    /**
     * 在生成上进行业务处理的时候,一定要考虑处理的健壮性以及你数据的准确性
     * 脏数据或者是不符合业务规则的数据是需要全部过滤掉之后
     * 再进行相应业务逻辑的处理
     *
     * 对于我们业务来说,我们只需要统计level=E的即可
     * 对于level非E的,不作为我们业务指标的统计范畴
     *
     * 数据清洗:就是按照我们业务规则吧原始输入的数据进行一定业务规则处理
     * 使得满足我们的业务需求为准
     */

//    logData.print().setParallelism(1)

    val mysqlData = env.addSource(new PKMySQLSource).setParallelism(1)
//    mysqlData.print().setParallelism(1)

    val connectData = logData.connect(mysqlData)
        .flatMap(new CoFlatMapFunction[(Long,String,Long),mutable.HashMap[String, String],String] {

          var userDomainMap = mutable.HashMap[String,String]()
          //log
          override def flatMap1(value: (Long, String, Long), collector: Collector[String]): Unit = {
            print("flatMap1 invoke .....")
            val domain = value._2
            val userId = userDomainMap.getOrElse(domain,"")
            println("~~~~~~~~~~~~~"+userId)

            collector.collect(value._1+"\t"+value._2+"\t"+value._3+"\t"+userId)
          }

          //mysql
          override def flatMap2(value: mutable.HashMap[String, String], collector: Collector[String]): Unit = {
            userDomainMap = value
          }
        })

    connectData.print()
    env.execute("LogAnalysis02")
  }

}

Flink进行数据清洗的总结:

  1. 读取Kafka的数据

  2. 读取MySql的数据

  3. connect

  4. 业务逻辑的处理分析:水印 WindowFunction

    ==>ES 注意数据类型 <= Kibana 图形化的统计结果展示

  5. Kibana 各个环节的监控 监控图形化

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值