Flink附带了一个集成的交互式Scala Shell。它既可以用于本地设置,也可以用于集群设置。
使用shell操作,只需要执行:
bin/start-scala-shell.sh local
在二进制Flink目录的根目录中。要在集群上运行Shell,请参阅下面的安装部分。
用法
shell支持DataSet、DataStream、Table API和SQL。启动后,四个不同的环境将自动进行预绑定。使用“benv”和“senv”分别访问批处理和流执行环境。使用“btenv”和“stenv”分别访问BatchTableEnvironment和StreamTableEnvironment。
DataSet API
下面的例子将在Scala shell中执行wordcount程序:
Scala-Flink> val text = benv.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,")
Scala-Flink> val counts = text
.flatMap { _.toLowerCase.split("\\W+") }
.map { (_, 1) }.groupBy(0).sum(1)
Scala-Flink> counts.print()
print()命令自动将指定的任务发送给JobManager执行,并在终端中显示计算结果。
你也可以将结果写入文件中,但你需要通过执行以下命令来运行你的程序:
Scala-Flink> benv.execute("MyProgram")
DataStream API
与上面的批处理程序类似,我们可以通过DataStream API执行流程序:
Scala-Flink> val textStreaming = senv.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,")
Scala-Flink> val countsStreaming = textStreaming
.flatMap { _.toLowerCase.split("\\W+") }
.map { (_, 1) }.keyBy(0).sum(1)
Scala-Flink> countsStreaming.print()
Scala-Flink> senv.execute("Streaming Wordcount")
注意,在流的案例中,print算子不会直接触发执行。
Flink Shell附带命令历史记录和自动完成功能。
Table API
下面的例子是一个使用Table API的wordcount程序(Stream和Batch):
//Stream
Scala-Flink> import org.apache.flink.table.functions.TableFunction
Scala-Flink> val textSource = stenv.fromDataStream(
senv.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,"),
'text)
Scala-Flink> class $Split extends TableFunction[String] {
def eval(s: String): Unit = {
s.toLowerCase.split("\\W+").foreach(collect)
}
}
Scala-Flink> val split = new $Split
Scala-Flink> textSource.join(split('text) as 'word).
groupBy('word).select('word, 'word.count as 'count).
toRetractStream[(String, Long)].print
Scala-Flink> senv.execute("Table Wordcount")
// Batch
Scala-Flink> import org.apache.flink.table.functions.TableFunction
Scala-Flink> val textSource = btenv.fromDataSet(
benv.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,"),
'text)
Scala-Flink> class $Split extends TableFunction[String] {
def eval(s: String): Unit = {
s.toLowerCase.split("\\W+").foreach(collect)
}
}
Scala-Flink> val split = new $Split
Scala-Flink> textSource.join(split('text) as 'word).
groupBy('word).select('word, 'word.count as 'count).
toDataSet[(String, Long)].print
注意,使用$作为TableFunction类名的前缀可以解决scala错误生成内部类名的问题。
SQL
下面的例子是一个用SQL编写的wordcount程序(Stream和Batch):
// Stream
Scala-Flink> import org.apache.flink.table.functions.TableFunction
Scala-Flink> val textSource = stenv.fromDataStream(
senv.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,"),
'text)
Scala-Flink> stenv.registerTable("text_source", textSource)
Scala-Flink> class $Split extends TableFunction[String] {
def eval(s: String): Unit = {
s.toLowerCase.split("\\W+").foreach(collect)
}
}
Scala-Flink> stenv.registerFunction("split", new $Split)
Scala-Flink> val result = stenv.sqlQuery("""SELECT T.word, count(T.word) AS `count`
FROM text_source
JOIN LATERAL table(split(text)) AS T(word)
ON TRUE
GROUP BY T.word""")
Scala-Flink> result.toRetractStream[(String, Long)].print
Scala-Flink> senv.execute("SQL Wordcount")
// Batch
Scala-Flink> import org.apache.flink.table.functions.TableFunction
Scala-Flink> val textSource = btenv.fromDataSet(
benv.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,"),
'text)
Scala-Flink> btenv.registerTable("text_source", textSource)
Scala-Flink> class $Split extends TableFunction[String] {
def eval(s: String): Unit = {
s.toLowerCase.split("\\W+").foreach(collect)
}
}
Scala-Flink> btenv.registerFunction("split", new $Split)
Scala-Flink> val result = btenv.sqlQuery("""SELECT T.word, count(T.word) AS `count`
FROM text_source
JOIN LATERAL table(split(text)) AS T(word)
ON TRUE
GROUP BY T.word""")
Scala-Flink> result.toDataSet[(String, Long)].print
添加外部依赖
可以向Scala-shell添加外部类路径。当调用execute时,这些文件将与shell程序一起自动发送到Jobmanager。
使用参数-a <path/to/jar.jar>或--addclasspath <path/to/jar.jar>来加载这些外部的类。
Setup
要了解Scala Shell提供了哪些选项,请使用以下命令:
bin/start-scala-shell.sh --help
Local
在Flink集群上执行shell,只需要执行以下命令:
bin/start-scala-shell.sh local
Remote
要将其与正在运行的集群一起使用,请使用关键字remote启动scala shell,并提供JobManager的主机和端口:
bin/start-scala-shell.sh remote <hostname> <portnumber>
Yarn Scala Shell cluster
可以将Flink集群部署在YARN上,仅供shell使用。YARN containers的数量可由参数-n <arg>控制。部署在一个Flink on YARN集群上,也可以指定YARN集群的参数,例如JM的内存、YARN应用程序的名称等。
例如:用两个TM启动Scala-shell的YARN集群。
bin/start-scala-shell.sh yarn -n 2
文末有完整的参数配置项。
Yarn Session
如果你使用Flink Yarn Session模式部署的Flink集群,可以直接使用以下命令:
bin/start-scala-shell.sh yarn
Full Reference(全部配置项)
Flink Scala Shell
Usage: start-scala-shell.sh [local|remote|yarn] [options] <args>...
Command: local [options]
Starts Flink scala shell with a local Flink cluster(本地集群启动Flink scala shell)
-a <path/to/jar> | --addclasspath <path/to/jar>
Specifies additional jars to be used in Flink(指定要在Flink中使用的其他jar)
Command: remote [options] <host> <port>
Starts Flink scala shell connecting to a remote cluster(启动连接到远程集群的Flink scala shell)
<host>
Remote host name as string
<port>
Remote port as integer
-a <path/to/jar> | --addclasspath <path/to/jar>
Specifies additional jars to be used in Flink(指定要在Flink中使用的其他jar)
Command: yarn [options]
Starts Flink scala shell connecting to a yarn cluster(启动Flink scala shell并连接到YARN集群)
-n arg | --container arg
Number of YARN container to allocate (= Number of TaskManagers)(要分配的YARN容器数量(等于TM数量))
-jm arg | --jobManagerMemory arg
Memory for JobManager container with optional unit (default: MB)(JM的内存,默认单位为MB,可选)
-nm <value> | --name <value>
Set a custom name for the application on YARN(自定义运行在YARN集群上的程序名称)
-qu <arg> | --queue <arg>
Specifies YARN queue(指定YARN的队列)
-s <arg> | --slots <arg>
Number of slots per TaskManager(TM的槽数)
-tm <arg> | --taskManagerMemory <arg>
Memory per TaskManager container with optional unit (default: MB)(TM的内存大小,默认单位MB,可选)
-a <path/to/jar> | --addclasspath <path/to/jar>
Specifies additional jars to be used in Flink(指定要在Flink中使用的其他jar)
--configDir <value>
The configuration directory.(配置目录)
-h | --help
Prints this usage text(打印以上内容,help)