构建第一个Flink应用-WordCount

本篇文章大概5143字,阅读时间大约13分钟

体验flink的hello world 

使用maven初始化第一个flink的wordcount应用,将应用打包上传到flink-standalone集群,运行起来。

1

文档编写目的

  • 使用maven生成flink的模板应用

  • 开发wordcount应用

2

构建maven工程

进入模板工程的目录,构建一个maven工程

mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.10.1

运行该命令会提示输入maven项目的groupId artifactId version信息,输入即可

将工程导入idea,引入flink-scala的依赖,去除模板项目中java依赖的scope

    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-scala_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>

添加scala编译插件

      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.4.6</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.0.0</version>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

3

Scala

StreamingWordCount

本地调试
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}

object StreamingWordCount {

  val HOST:String = "localhost"
  val PORT:Int = 9001

  /**
   * stream word count
   * @param args input params
   */
  def main(args: Array[String]): Unit = {

    //get streaming env
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //get socket text stream
    val wordsDstream: DataStream[String] = env.socketTextStream(HOST, PORT)

    import org.apache.flink.api.scala._

    //word count
    val wordRes: DataStream[(String, Int)] = wordsDstream.flatMap(_.split(","))
      .filter(_.nonEmpty)
      .map((_, 1))
      .keyBy(0)
      .sum(1)

    wordRes.print()
    
    env.execute("Flink Streaming WordCount!")
  }
}

启动应用,在终端进行socket word输入

nc -lk 9001

终端输入word数据流

streaming应用的控制台中可以看到

streaming word count调试完成

集群运行

按照之前文章中编译的flink-1.10.1的包,启动集群

./bin/start-cluster.sh

访问localhost:8081出现flink-web

在submit new job中上传刚才打包好的应用程序,在maven中package一下就可以,点击submit运行

在终端上输入words,采用逗号分隔

查看task managers中的stdout

BatchWordCount

import org.apache.flink.api.scala.ExecutionEnvironment

object BatchWordCount {

  /**
   * batch word count
   *
   * @param args input params
   */
  def main(args: Array[String]): Unit = {

    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    import org.apache.flink.api.scala._

    val words: DataSet[String] = env.fromElements("spark,flink,hbase", "impala,hbase,kudu", "flink,flink,flink")

    //word count
    val wordRes: AggregateDataSet[(String, Int)] = words.flatMap(_.split(","))
      .map((_, 1))
      .groupBy(0)
      .sum(1)

    wordRes.print()
  }
}

运行结果如下:

4

Java

BatchWordCount

package com.eights;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
import org.apache.flink.util.StringUtils;

public class BatchJob {

    public static void main(String[] args) throws Exception {
        // set up the batch execution environment
        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSource<String> words = env.fromElements("spark,flink,hbase", "impala,hbase,kudu", "flink,flink,flink");

        AggregateOperator<Tuple2<String, Integer>> wordCount = words.flatMap(new WordLineSplitter())
                .groupBy(0)
                .sum(1);

        wordCount.print();

    }

    public static final class WordLineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {

        @Override
        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) {
            String[] wordsArr = s.split(",");

            for (String word : wordsArr) {
                if (!StringUtils.isNullOrWhitespaceOnly(word)) {
                    collector.collect(new Tuple2<>(word, 1));
                }
            }

        }
    }
}

运行结果

StreamingWordCount

package com.eights;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import org.apache.flink.util.StringUtils;

public class StreamingJob {

    public static void main(String[] args) throws Exception {
        // set up the streaming execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        String HOST = "localhost";
        int PORT = 9001;

        DataStreamSource<String> wordsSocketStream = env.socketTextStream(HOST, PORT);

        SingleOutputStreamOperator<Tuple2<String, Integer>> wordRes = wordsSocketStream.flatMap(new WordsLineSplitter())
                .keyBy(0)
                .sum(1);

        wordRes.print();

        // execute program
        env.execute("Flink Streaming Java API Word Count");
    }

    private static class WordsLineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) {
            String[] wordsArr = s.split(",");

            for (String word : wordsArr) {
                if (!StringUtils.isNullOrWhitespaceOnly(word)) {
                    collector.collect(new Tuple2<>(word, 1));
                }
            }
        }
    }
}

运行结果如下

Ps:

编写文档的目的,主要是备忘和记录自己的大数据组件学习路径,记下坑和处理的流程。每周坚持写两篇吧,一年之后回头看自己的大数据之路~

点个“在看”表示朕

已阅

Apache DolphinScheduler是一个新一代分布式大数据工作流任务调度系统,致力于“解决大数据任务之间错综复杂的依赖关系,整个数据处理开箱即用”。它以 DAG(有向无环图) 的方式将任务连接起来,可实时监控任务的运行状态,同时支持重试、从指定节点恢复失败、暂停及 Kill任务等操作。目前已经有像IBM、腾讯、美团、360等400多家公司生产上使用。 调度系统现在市面上的调度系统那么多,比如老牌的Airflow, Oozie,Kettle,xxl-job ,Spring Batch等等, 为什么要选DolphinScheduler ? DolphinScheduler 的定位是大数据工作流调度。通过把大数据和工作流做了重点标注. 从而可以知道DolphinScheduler的定位是针对于大数据体系。 DolphinScheduler是非常强大的大数据调度工具,有以下一些特点:1、通过拖拽以DAG 图的方式将 Task 按照任务的依赖关系关联起来,可实时可视化监控任务的运行状态;2、支持丰富的任务类型;3、支持工作流定时调度、依赖调度、手动调度、手动暂停/停止/恢复,同时支持失败重试/告警、从指定节点恢复失败、Kill 任务等操作;4、支持工作流全局参数及节点自定义参数设置;5、支持集群HA,通过 Zookeeper实现 Master 集群和 Worker 集群去中心化;6、支持工作流运行历史树形/甘特图展示、支持任务状态统计、流程状态统计;7、支持补数,并行或串行回填数据。课程会带大家构建DolphinScheduler大数据调度平台,实战讲解多种任务调度配置,会基于案例讲解DolphinScheduler使用,让大家在实战中掌握DolphinScheduler。 DolphinScheduler 发展很快 很多公司调度都切换到了DolphinScheduler,掌握DolphinScheduler调度使用势在必行,抓住新技术机遇,为跳巢涨薪做好准备。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值