flink--demo

最新推荐文章于 2024-08-26 09:00:39 发布

yostkevin

最新推荐文章于 2024-08-26 09:00:39 发布

阅读量3.5k

点赞数

分类专栏： flink 文章标签： flink

本文链接：https://blog.csdn.net/u014384314/article/details/82763403

版权

flink 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1.pom.xml

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-java</artifactId>
  <version>1.6.0</version>
</dependency>
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-streaming-java_2.11</artifactId>
  <version>1.6.0</version>
</dependency>
<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-clients_2.11</artifactId>
  <version>1.6.0</version>
</dependency>

Scala API: To use the Scala API, replace the flink-java artifact id with flink-scala_2.11 and flink-streaming-java_2.11 with flink-streaming-scala_2.11.

2.以文件作为数据源

readTextFile(Stringpath)：默认读取TextInputFormat格式，每行作为一个字符串；

readTextFileWithValue(Stringpath)：返回StringValues，StringValues作为mutable字符串；

readCsvFile(Stringpath)：返回Java POJOS或者tuples；

readFileofPremitives(path, delimiter, class)：解析一行数据到指定的class；

readHadoopFile(FileInputFormat, Key, Value, path)：读取Hadoop文件，指定路径、文件格式以及key、value class；具体参见下边的图；

readSequenceFile(Key, Value, path)：读取SequenceFile格式的文件，同样需指定key、value的class。

val env = ExecutionEnvironment.getExecutionEnvironment

// get input data
val text = env.readTextFile("/path/to/file")

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
  .map { (_, 1) }
  .groupBy(0)
  .sum(1)

counts.writeAsCsv(outputPath, "\n", " ")

3.以控制台输入为数据源

$ nc -l 9000

abc,sad,as

asd,a

然后提交 Flink 程序

$ ./bin/flink run examples/streaming/SocketWordCount.jar --port 9000

$ bin/flink run examples/streaming/SocketWordCount.jar \
  --hostname slave01 \
  --port 9000

package cn.com.xxx.zzy;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * Created with IntelliJ IDEA.
 * To change this template use File | Settings | File Templates.
 */
public class SocketWordCount {

    public static void main(String[] args) throws Exception {
        // the port to connect to
        final int port;
        try {
            final ParameterTool params = ParameterTool.fromArgs(args);
            port = params.getInt("port");
        } catch (Exception e) {
            System.err.println("No port specified. Please run 'SocketWordCount --port <port>'");
            return;
        }

        // get the execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // get input data by connecting to the socket
        DataStream<String> text = env.socketTextStream("localhost", port, "\n");

        // parse the data, group it, window it, and aggregate the counts
        DataStream<WordWithCount> windowCounts = text
                .flatMap(new FlatMapFunction<String, WordWithCount>() {
                    @Override
                    public void flatMap(String value, Collector<WordWithCount> out) {
                        for (String word : value.split("\\s")) {
                            out.collect(new WordWithCount(word, 1L));
                        }
                    }
                })
                .keyBy("word")
                .timeWindow(Time.seconds(5), Time.seconds(1))
                .reduce(new ReduceFunction<WordWithCount>() {
                    @Override
                    public WordWithCount reduce(WordWithCount a, WordWithCount b) {
                        return new WordWithCount(a.word, a.count + b.count);
                    }
                });

        // print the results with a single thread, rather than in parallel
        windowCounts.print().setParallelism(1);

        env.execute("Socket WordCount");

    }

    // Data type for words with count
    public static class WordWithCount {

        public String word;
        public long count;

        public WordWithCount() {
        }

        public WordWithCount(String word, long count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return word + " : " + count;
        }
    }
}

4. 以java集合为数据源

fromCollection(Collection)

fromCollection(Iterator, Class)：也可以读取iterator，其数据本身的类型为指定的class；

fromElements(T)：读取sequence对象；

fromParallelCollection(SplittableIterator, Class)：读取并行iterator；

generateSequence(from, to)：读取一定范围的sequnce对象。

package com.gr.dologic
import java.util

import org.apache.flink.api.java.aggregation.Aggregations
import org.apache.flink.api.scala._
object flink1 {
  def main(args: Array[String]): Unit = {

    val env = ExecutionEnvironment.getExecutionEnvironment
    val list = new util.ArrayList[Int]();
    list.add(1);
    list.add(2);
    list.add(3);
    val stream = env.fromElements(1,2,3,4,3,4,3,3,5).map(arr=>{
      arr
    }).filter(_>=2).map{x=>(x,1)}.groupBy(0).aggregate(Aggregations.SUM,1)//.sum(1)
    stream.print
    // env.execute()
    /*Exception in thread "main" java.lang.RuntimeException: No new data sinks have been defined since the last execution.
    The last execution refers to the latest call to 'execute()', 'count()', 'collect()', or 'print()'.
    参照此文，原因是print()方法自动会调用execute()方法，造成错误，所以注释掉env.execute()即可*/

  }
}

5.以kafka为数据源

val properties = new Properties();

properties.setProperty("bootstrap.servers","localhost:9092");

properties.setProperty("zookeeper.connect","localhost:2181");

properties.setProperty("group.id","test");

val stream = env.addSource(new FlinkKafkaConsumer09[String]("mytopic",new SimpleStringSchema(),properties))//.print

对温度计算平均值
DataStream<Tuple2<String,Double>> keyedStream = env.addSource(new FlinkKafkaConsumer09[String]("mytopic",new SimpleStringSchema(),properties)).flatMap(new Splitter()).keyBy(0)
.timeWindow(Time.second(300))
.apply(new WindowFunction<Tuple2<String,Double>,Tuple2<String,Double>,Tuple,TimeWindow>() {
    public void apply(Tuple key,TimeWindow window,Iterable<Tuple2<String,Double>> input,Collector<Tuple2<String,Double>> out) throws Exception{
        double sum = 0L;
        int count = 0;
        for(Tuple2<String,Double> record : input) {
            sum += record.f1;   
            count ++;
        }
    Tuple2<String,Double> result = input.iterator().next();
    result.f1 = (sum/count);
    out.collect(result);
    }

});

要注意容错，设置checkpoint

kafka producer也可以作为sink来用

stream.addSink(new FlinkKafkaProducer09<String>("localhost:9092","mytopic",new SimpleStringSchema()));

6.以关系型数据库为数据源

7.Table API

Flink提供了一个table接口来进行批处理和流处理，这个接口叫做Table API。一旦dataset/datastream被注册为table后，就可以引用聚合、join和select等关系型的操作了。

Table同样可以通过标准SQL来操作，操作执行后，需要将table转换为dataSet/datastream。Flink内部中使用开源框架Apache Calcite来优化这些转换操作
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table_2.11</artifactId>
<version>1.1.4</version>
</dependency>

8.参考学习文档精通Apache Flink读书笔记 1-5

https://blog.csdn.net/lmalds/article/details/60867262