【从零开始的大数据学习】Flink官方教程学习笔记(一)

学习资源

基础Scala语法

The Scala Book:https://docs.scala-lang.org/overviews/scala-book/

Scala Basics:https://docs.scala-lang.org/tour/basics.html

Scala数据结构专题

  • Scala 列表:
    • https://www.runoob.com/scala/scala-lists.html
    • https://docs.scala-lang.org/overviews/scala-book/list-class.html#inner-main
    • Scala中List是不可改变的!!!
    • 用apply索引
  • Scala 元组:https://www.runoob.com/scala/scala-tuples.html
  • Scala Set:
    • https://docs.scala-lang.org/scala3/book/collections-classes.html#working-with-sets
    • take索引
    • 有序集合:https://www.cnblogs.com/zhaohadoopone/p/9534982.html
val s =scala.collection.mutable.LinkedHashSet[Tuple2[String,String]]()
声明变量
var x = 2 // 变量
x =  x + 1  // correct
val x = 2 // 常量
x =  x + 1  // wrong!
代码块
  • 块的最后一行是块的值
println({
  val x = 1 + 1
  x + 1 // the last expression is the result of the whole block
}) // 3
函数(function)
  • 定义方式:函数名=(参数:参数类型)=>返回值
val addOne = (x: Int) => x + 1
println(addOne(1)) // 2

参数列表可以为空:

val getTheAnswer = () => 42
println(getTheAnswer()) // 42
  • 函数本身就可以是一个参数:
def whileLoop(condition: => Boolean)(body: => Unit): Unit =
{}
方法(methods)

def 函数名(参数:参数类型,…):
最后一行是返回值

def add(x:Int,y:Int):
  Int = x+y
Traits (接口)

包含自己的方法,可以被继承和实现,但是不能被初始化。

  • useful as generic types and with abstract methods.
trait Iterator[A] { def hasNext: Boolean def next(): A }

Extending the trait Iterator[A] requires a type A and implementations of the methods hasNext and next.

  • 继承用class extends 实现
class IntIterator(to: Int) extends Iterator[Int]{
private var t=0
override def hasNext: 
    Boolean = current < to 
override def next(): Int ={t}
}
class(类)
class User 
val user1 = new User // 实例化了一个类

示例代码:

//class
class  User
val user = new User

class Point(var x:Int, var y:Int){ // x 和y是Point class的成员变量
  def move(dx:Int, dy:Int):Unit = {
    x = x + dx
    y=  y + dy
  }

  override def toString: String =
    s"($x, $y)"
}

val point1 = new Point(2, 3)
println(point1.x)  // prints 2
println(point1)

注意:Java和Scala混用时,constructor的默认的参数值不管用。

tuple(元组)

和Python的元组相似:

val ingredient = ("Sugar", 25)
println(ingredient._1) // Sugar 
println(ingredient._2) // 25
Mutiple Parameter List
val numbers = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 
val res = numbers.foldLeft(0)((m, n) => m + n) 
println(res) // 55

传入的第一个参数是初始值,第二个参数是一个函数,用于定义初始值和List中每个元素的运算

Flink Exercises (强推)

Flink 教程

Flink数据集

rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
isStart        : Boolean   // TRUE for ride start events, FALSE for ride end events
eventTime      : Instant   // the timestamp for this event
startLon       : Float     // the longitude of the ride start location
startLat       : Float     // the latitude of the ride start location
endLon         : Float     // the longitude of the ride end location
endLat         : Float     // the latitude of the ride end location
passengerCnt   : Short     // number of passengers on the ride

fare event

rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
startTime      : Instant   // the start time for this ride
paymentType    : String    // CASH or CARD
tip            : Float     // tip for this ride
tolls          : Float     // tolls for this ride
totalFare      : Float     // total fare collected

配置Flink Tutorial所需的环境

  1. 安装Flink:
  • 注意Java版本:必须在11以上
  • 注意Flink版本:没有Web UI可能是Flink的版本太低(1.4没有,1.9实测可用)
  • Windows下推荐使用cygdrive命令行环境
    安装后界面:
    在这里插入图片描述

2.下载Flink Tutorial:
https://github.com/apache/flink-training/tree/release-1.15/

在Win界面下可能会报错【# \r‘:command not found】
在这里插入图片描述
修改换行方式:
https://blog.csdn.net/fangye945a/article/details/120660824
在这里插入图片描述
3. 完善Exercise.scala文件,运行test文件,即可看到结果。
Scala中也可以调用Java的函数:https://docs.scala-lang.org/scala3/book/interacting-with-java.html

Flink Tutorial学习笔记

流式处理

教程链接: https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/overview/

  • Streams are data’s natural habitat.
  • 批处理: ingest the entire dataset before producing any results
  • 流处理:the input may never end, and so you are forced to continuously process the data as it arrives.
    在Flink中,数据流从source中读入,被operator转换,最终流入sink.
  • 一次转换可能包含多个operator.

流可以从消息队列或分布式日志系统中读入,例如:Apache Kafka or Kinesis.但是Flink也可以读入bounded的数据来源。输出同理。
在这里插入图片描述

并行的数据流

Flink程序内在本身就是并行且分布式的。

  • 每一个数据流都有多个stream partition.

  • 每一个operator都有多个operator subtasks,不同操作符的并行级别不一样.

    • The number of operator subtasks is the parallelism of that particular operator. Different operators of the same program may have different levels of parallelism.
    • 在这里插入图片描述
  • One-to-one streams:例如source和map

  • Redistributing streams:

    • 改变了数据流的划分
    • introduce non-determinism regarding the order in which the aggregated results for different keys arrive at the Sink
  • 实时流:可以通过在数据中加入时间戳

有状态的流处理

  • Flink’s operations can be stateful. This means that how one event is handled can depend on the accumulated effect of all the events that came before it.

  • A Flink application is run in parallel on a distributed cluster. The various parallel instances of a given operator will execute independently,in separate threads, and in general will be running on different machines.
    在这里插入图片描述

  • The 3rd operator is stateful.

  • A fully-connected network shuffle is occurring between the second and third operators.

  • This is being done to partition the stream by some key, so that all of the events that need to be processed together, will be.

数据流API

教程链接:https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/datastream_api/

  • 可以流式处理的:
  • basic types, i.e., String, Long, Integer, Boolean, Array
  • composite types: Tuples, POJOs, and Scala case classes

执行环境

  • Streaming applications need to use a StreamExecutionEnvironment.
  • When env.execute() is called the job graph is packaged up and sent to the JobManager, which parallelizes the job and distributes slices of it to the Task Managers for execution.
  • Each parallel slice of your job will be executed in a task slot.
    在这里插入图片描述

Basic Stream functions

1.env.fromCollections:从列表创建

List<Person> people = new ArrayList<Person>();

people.add(new Person("Fred", 35));
people.add(new Person("Wilma", 35));
people.add(new Person("Pebbles", 2));

DataStream<Person> flintstones = env.fromCollection(people);

2.env.socketTextStream/readTextfile 从远程/文件读取

DataStream<String> lines = env.socketTextStream("localhost", 9999);
DataStream<String> lines = env.readTextFile("file:///path");

小结:流主要通过env中的函数读取。

Streams could also be debugged by inserting local breakpoints,etc.

ETL

教程链接:https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/etl/
Flink’s table API:https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/overview/

  • map(): only suitable for one-to-one corespondence (全射)
    • for each and every stream element coming in, map() will emit one transformed element.
  • flatmap(): otherwise cases
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值