Flink部署和操作之 Scala解释器（Scala Shell）

最新推荐文章于 2022-06-08 18:17:00 发布

張萠飛

最新推荐文章于 2022-06-08 18:17:00 发布

阅读量584

点赞数

分类专栏： Flink 文章标签： flink 1.9 scala解释器

大数据同时被 2 个专栏收录

97 篇文章 1 订阅

订阅专栏

Flink

41 篇文章 5 订阅

订阅专栏

Flink附带了一个集成的交互式Scala Shell。它既可以用于本地设置，也可以用于集群设置。

使用shell操作，只需要执行：

bin/start-scala-shell.sh local

在二进制Flink目录的根目录中。要在集群上运行Shell，请参阅下面的安装部分。

用法

shell支持DataSet、DataStream、Table API和SQL。启动后，四个不同的环境将自动进行预绑定。使用“benv”和“senv”分别访问批处理和流执行环境。使用“btenv”和“stenv”分别访问BatchTableEnvironment和StreamTableEnvironment。

DataSet API

下面的例子将在Scala shell中执行wordcount程序：

Scala-Flink> val text = benv.fromElements(
  "To be, or not to be,--that is the question:--",
  "Whether 'tis nobler in the mind to suffer",
  "The slings and arrows of outrageous fortune",
  "Or to take arms against a sea of troubles,")
Scala-Flink> val counts = text
    .flatMap { _.toLowerCase.split("\\W+") }
    .map { (_, 1) }.groupBy(0).sum(1)
Scala-Flink> counts.print()

print()命令自动将指定的任务发送给JobManager执行，并在终端中显示计算结果。

你也可以将结果写入文件中，但你需要通过执行以下命令来运行你的程序：

Scala-Flink> benv.execute("MyProgram")

DataStream API

与上面的批处理程序类似，我们可以通过DataStream API执行流程序：

Scala-Flink> val textStreaming = senv.fromElements(
  "To be, or not to be,--that is the question:--",
  "Whether 'tis nobler in the mind to suffer",
  "The slings and arrows of outrageous fortune",
  "Or to take arms against a sea of troubles,")
Scala-Flink> val countsStreaming = textStreaming
    .flatMap { _.toLowerCase.split("\\W+") }
    .map { (_, 1) }.keyBy(0).sum(1)
Scala-Flink> countsStreaming.print()
Scala-Flink> senv.execute("Streaming Wordcount")

注意，在流的案例中，print算子不会直接触发执行。

Flink Shell附带命令历史记录和自动完成功能。

Table API

下面的例子是一个使用Table API的wordcount程序（Stream和Batch）：

//Stream

Scala-Flink> import org.apache.flink.table.functions.TableFunction
Scala-Flink> val textSource = stenv.fromDataStream(
  senv.fromElements(
    "To be, or not to be,--that is the question:--",
    "Whether 'tis nobler in the mind to suffer",
    "The slings and arrows of outrageous fortune",
    "Or to take arms against a sea of troubles,"),
  'text)
Scala-Flink> class $Split extends TableFunction[String] {
    def eval(s: String): Unit = {
      s.toLowerCase.split("\\W+").foreach(collect)
    }
  }
Scala-Flink> val split = new $Split
Scala-Flink> textSource.join(split('text) as 'word).
    groupBy('word).select('word, 'word.count as 'count).
    toRetractStream[(String, Long)].print
Scala-Flink> senv.execute("Table Wordcount")


// Batch

Scala-Flink> import org.apache.flink.table.functions.TableFunction
Scala-Flink> val textSource = btenv.fromDataSet(
  benv.fromElements(
    "To be, or not to be,--that is the question:--",
    "Whether 'tis nobler in the mind to suffer",
    "The slings and arrows of outrageous fortune",
    "Or to take arms against a sea of troubles,"), 
  'text)
Scala-Flink> class $Split extends TableFunction[String] {
    def eval(s: String): Unit = {
      s.toLowerCase.split("\\W+").foreach(collect)
    }
  }
Scala-Flink> val split = new $Split
Scala-Flink> textSource.join(split('text) as 'word).
    groupBy('word).select('word, 'word.count as 'count).
    toDataSet[(String, Long)].print

注意，使用$作为TableFunction类名的前缀可以解决scala错误生成内部类名的问题。

SQL

下面的例子是一个用SQL编写的wordcount程序（Stream和Batch）：

// Stream

Scala-Flink> import org.apache.flink.table.functions.TableFunction
Scala-Flink> val textSource = stenv.fromDataStream(
  senv.fromElements(
    "To be, or not to be,--that is the question:--",
    "Whether 'tis nobler in the mind to suffer",
    "The slings and arrows of outrageous fortune",
    "Or to take arms against a sea of troubles,"), 
  'text)
Scala-Flink> stenv.registerTable("text_source", textSource)
Scala-Flink> class $Split extends TableFunction[String] {
    def eval(s: String): Unit = {
      s.toLowerCase.split("\\W+").foreach(collect)
    }
  }
Scala-Flink> stenv.registerFunction("split", new $Split)
Scala-Flink> val result = stenv.sqlQuery("""SELECT T.word, count(T.word) AS `count` 
    FROM text_source 
    JOIN LATERAL table(split(text)) AS T(word) 
    ON TRUE 
    GROUP BY T.word""")
Scala-Flink> result.toRetractStream[(String, Long)].print
Scala-Flink> senv.execute("SQL Wordcount")

// Batch

Scala-Flink> import org.apache.flink.table.functions.TableFunction
Scala-Flink> val textSource = btenv.fromDataSet(
  benv.fromElements(
    "To be, or not to be,--that is the question:--",
    "Whether 'tis nobler in the mind to suffer",
    "The slings and arrows of outrageous fortune",
    "Or to take arms against a sea of troubles,"), 
  'text)
Scala-Flink> btenv.registerTable("text_source", textSource)
Scala-Flink> class $Split extends TableFunction[String] {
    def eval(s: String): Unit = {
      s.toLowerCase.split("\\W+").foreach(collect)
    }
  }
Scala-Flink> btenv.registerFunction("split", new $Split)
Scala-Flink> val result = btenv.sqlQuery("""SELECT T.word, count(T.word) AS `count` 
    FROM text_source 
    JOIN LATERAL table(split(text)) AS T(word) 
    ON TRUE 
    GROUP BY T.word""")
Scala-Flink> result.toDataSet[(String, Long)].print

添加外部依赖

可以向Scala-shell添加外部类路径。当调用execute时，这些文件将与shell程序一起自动发送到Jobmanager。

使用参数-a <path/to/jar.jar>或--addclasspath <path/to/jar.jar>来加载这些外部的类。

Setup

要了解Scala Shell提供了哪些选项，请使用以下命令：

bin/start-scala-shell.sh --help

Local

在Flink集群上执行shell，只需要执行以下命令：

bin/start-scala-shell.sh local

Remote

要将其与正在运行的集群一起使用，请使用关键字remote启动scala shell，并提供JobManager的主机和端口：

bin/start-scala-shell.sh remote <hostname> <portnumber>

Yarn Scala Shell cluster

可以将Flink集群部署在YARN上，仅供shell使用。YARN containers的数量可由参数-n <arg>控制。部署在一个Flink on YARN集群上，也可以指定YARN集群的参数，例如JM的内存、YARN应用程序的名称等。

例如：用两个TM启动Scala-shell的YARN集群。

 bin/start-scala-shell.sh yarn -n 2

文末有完整的参数配置项。

Yarn Session

如果你使用Flink Yarn Session模式部署的Flink集群，可以直接使用以下命令：

 bin/start-scala-shell.sh yarn

Full Reference（全部配置项）

Flink Scala Shell
Usage: start-scala-shell.sh [local|remote|yarn] [options] <args>...

Command: local [options]
Starts Flink scala shell with a local Flink cluster（本地集群启动Flink scala shell）
  -a <path/to/jar> | --addclasspath <path/to/jar>
        Specifies additional jars to be used in Flink（指定要在Flink中使用的其他jar）
Command: remote [options] <host> <port>
Starts Flink scala shell connecting to a remote cluster（启动连接到远程集群的Flink scala shell）
  <host>
        Remote host name as string
  <port>
        Remote port as integer

  -a <path/to/jar> | --addclasspath <path/to/jar>
        Specifies additional jars to be used in Flink（指定要在Flink中使用的其他jar）
Command: yarn [options]
Starts Flink scala shell connecting to a yarn cluster（启动Flink scala shell并连接到YARN集群）
  -n arg | --container arg
        Number of YARN container to allocate (= Number of TaskManagers)（要分配的YARN容器数量(等于TM数量)）
  -jm arg | --jobManagerMemory arg
        Memory for JobManager container with optional unit (default: MB)（JM的内存，默认单位为MB，可选）
  -nm <value> | --name <value>
        Set a custom name for the application on YARN（自定义运行在YARN集群上的程序名称）
  -qu <arg> | --queue <arg>
        Specifies YARN queue（指定YARN的队列）
  -s <arg> | --slots <arg>
        Number of slots per TaskManager（TM的槽数）
  -tm <arg> | --taskManagerMemory <arg>
        Memory per TaskManager container with optional unit (default: MB)（TM的内存大小，默认单位MB，可选）
  -a <path/to/jar> | --addclasspath <path/to/jar>
        Specifies additional jars to be used in Flink（指定要在Flink中使用的其他jar）
  --configDir <value>
        The configuration directory.（配置目录）
  -h | --help
        Prints this usage text（打印以上内容，help）

張萠飛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Flink部署和操作之 Scala解释器（Scala Shell）

Flink附带了一个集成的交互式Scala Shell。它既可以用于本地设置，也可以用于集群设置。使用shell操作，只需要执行：bin/start-scala-shell.sh local在二进制Flink目录的根目录中。要在集群上运行Shell，请参阅下面的安装部分。用法shell支持DataSet、DataStream、Table API和SQL。启动后，四个不同的环境...
复制链接

扫一扫