Flink的API概览
1、dataStream的数据源
1.1、socket数据源socketTextStream
从socket当中接收数据,并统计最近5秒钟每个单词出现的次数
第一步:node01开发socket服务
node01执行以下命令开启socket服务
nc -lk 9000
第二步:开发代码实现
import org.apache.flink.streaming.api.scala.{
DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
object FlinkSource1 {
def main(args: Array[String]): Unit = {
//获取程序入口类
val streamExecution: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val socketText: DataStream[String] = streamExecution.socketTextStream("node01",9000)
//注意:必须要添加这一行隐式转行,否则下面的flatmap方法执行会报错
import org.apache.flink.api.scala._
val result: DataStream[(String, Int)] = socketText.flatMap(x => x.split(" "))
.map(x => (x, 1))
.keyBy(0)
.timeWindow(Time.seconds(5), Time.seconds(5)) //统计最近5秒钟的数据
.sum(1)
//打印结果数据
result.print().setParallelism(1)
//执行程序
streamExecution.execute()
}
}
1.2、文件数据源readTextFile
读取hdfs路径下面所有的文件数据进行处理
第一步:添加maven依赖
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0-mr1-cdh5.14.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0-cdh5.14.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.0-cdh5.14.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.6.0-cdh5.14.2</version>
</dependency>
第二步:代码实现
object FlinkSource2 {
def main(args: Array[String]): Unit = {
val executionEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
//从文本读取数据
val hdfStream: DataStream[String] = executionEnvironment.readTextFile("hdfs://node01:8020/flink_input/")
val result: DataStream[(String, Int)] = hdfStream.flatMap(x => x.split(" ")).map(x =>(x,1)).keyBy(0).sum(1)
result.print().setParallelism(1)
executionEnvironment.execute("hdfsSource")
}
}
1.3、集合数据源fromElements
代码实现
// Scala: 单词计数
import org.apache.flink.streaming.api.scala.{
DataStream, StreamExecutionEnvironment}
object FlinkSource3 {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val value: DataStream[String] = environment.fromElements[String]("hello world","spark flink")
val result2: DataStream[(String, Int)]
= value.flatMap(x => x.split(" "))
.map(x =>(x,1))
.keyBy(0).sum(1)
result2.print().setParallelism(1)
environment.execute()
}
}
// java: 单词加前缀
public class StreamingSourceFromCollection {
public static void main(String[] args) throws Exception {
//步骤一:获取环境变量
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//步骤二:模拟数据
ArrayList<String> data = new ArrayList