myflink

最新推荐文章于 2024-10-14 16:58:19 发布

扣码

最新推荐文章于 2024-10-14 16:58:19 发布

阅读量113

点赞数

文章标签： flink

本文链接：https://blog.csdn.net/m0_47443458/article/details/131549167

版权

本文介绍了ElasticSearch的命令行操作，包括创建、删除和查询索引，以及添加、更新和删除文档。接着讲解了Flink的批处理和流处理WordCount示例，详细阐述了数据源、转换、分组和窗口操作。还提到了Flink的JavaAPI、数据源与接收器，以及窗口分类、事件时间和水印的概念。此外，文章讨论了数据清洗、流处理中的窗口和事件时间、watermark、迟到数据处理、异步IO、实时TopN以及FlinkSQL的应用。最后，简要提及了FlinkCDC和CEP，以及Hbase和Clickhouse的基础知识。

摘要由CSDN通过智能技术生成

1.ElasticSearch

1.1 ES命令：

1 a.html 清华大学，简称清华 2 b.html 清华大学高考分数线，…… 3 c.html 清华研究生招生网 4 d.html 北京天安门 5 e.html hello kitty

mysearch: {"id":"1","filename":"a.html","content":"清华大学，简称清华"}

索引：


创建索引：
PUT mysearch
PUT mysearch2
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3, 
            "number_of_replicas" : 2 
        }
    }
}

删除索引：
DELETE person

查看索引信息：
GET mysearch2
返回结果：
{
  "mysearch2" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "creation_date" : "1681875342768",
        "number_of_shards" : "3",
        "number_of_replicas" : "2",
        "uuid" : "r3Q7_PsZQyGGljoCp6xxWg",
        "version" : {
          "created" : "6060299"
        },
        "provided_name" : "mysearch2"
      }
    }
  }
}

文档：添加、修改文档： {"id":"1","filename":"a.html","content":"清华大学，简称清华"} {"id":"2","filename":"b.html","content":"清华大学高考分数线"} {"id":"3","filename":"c.html","content":"清华研究生招生网"} {"id":"4","filename":"d.html","content":"北京天安门"} {"id":"5","filename":"e.html","content":"hello kitty"}

PUT mysearch/_doc/1
{"id":"1","filename":"a.html","content":"清华大学，简称清华"}

PUT mysearch/_doc/2
{"id":"2","filename":"b.html","content":"清华大学高考分数线"}

PUT mysearch/_doc/3
{"id":"3","filename":"c.html","content":"清华研究生招生网"}

PUT mysearch/_doc/4
{"id":"4","filename":"d.html","content":"北京天安门"}

PUT mysearch/_doc/5
{"id":"5","filename":"e.html","content":"hello kitty"}

PUT mysearch/_doc/
{"id":"6","filename":"f.html","content":"hello snoopy"}

根据_id查询文档：
    GET mysearch/_doc/3
    返回内容：
    {
      "_index" : "mysearch",
      "_type" : "_doc",
      "_id" : "3",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "id" : "3",
        "filename" : "c.html",
        "content" : "清华研究生招生网"
      }
    }


    
删除文档：
DELETE /mysearch/_doc/1

搜索文档：
POST mysearch/_search
{
    "query" : {
        "match" : {"content":"清华大学"}
    }
}

带高亮的搜索：(高亮字段要与搜索的字段一致)
GET /_search
{
    "query" : {
        "match": { "content": "清华大学" }
    },
    "highlight" : {
      "pre_tags":"<font color='red'>",
      "post_tags": "</font>", 
        "fields" : {
            "content" : {}
        }
    }
}

分词： hello kitty分词：

hello e.html kitty f.html

清华大学分词：清 b.html,c.html 华 …… 大 …… 学 ……

清华 …… 大学 …… 清华大学 ……

1.2 ES Java API:

编码结构： 1、创建连接 2、创建Request请求 3、提交Request请求 4、返回结果集

2. Flink

官网：Apache Flink Documentation | Apache Flink

2.1 Flink wordcount

批处理：wordcount:Overview | Apache Flink

import org.apache.flink.api.scala._

object WordCount {
  def main(args: Array[String]) {

    val env = ExecutionEnvironment.getExecutionEnvironment
    val text = env.fromElements(
      "Who's there?",
      "I think I hear them. Stand, ho! Who's there?")

    val counts = text
      .flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
      .groupBy(0)
      .sum(1)

    counts.print()
  }
}

流处理：wordcount:Overview | Apache Flink

import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

object WindowWordCount {
  def main(args: Array[String]) {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("localhost", 9999)

    val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
      .keyBy(_._1)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .sum(1)

    counts.print()

    env.execute("Window Stream WordCount")
  }
}

处理过程：

1、创建批处理/流处理的执行环境

2、读取数据源（kafka mysql等）

3、转换分析

4、数据落地（保存到hdfs mysql redis 等）

5、提交执行（部署）

2.2 打包部署

任务Job: jar包

Standalone:

Overview | Apache Flink

需要启动flink进程：bin/start-cluster.sh

StandaloneSessionClusterEntrypoint: 管理所有job

TaskManagerRunner : 用于执行算子task

访问：http://hdp1:8081/#/overview

资源分配单位是：Slot

提交任务：./bin/flink run ./examples/streaming/TopSpeedWindowing.jar

>注意：如果曾经在yarn上提交过任务，那么回到standalone重新运行则会失败，原因是有yarn的临时文件存在，任务就会找yarn来运行。方法是将yarn的临时文件删除：rm -rf /tmp/.yarn-properties-root

Yarn:

YARN | Apache Flink

不需要启动flink进程：

start-all.sh :

访问：http://hdp1:8088

资源分配单位：Container

提交任务：

session部署模式：

先启动yarn-session（中介）：./bin/yarn-session.sh --detached

提交任务：./bin/flink run ./examples/streaming/TopSpeedWindowing.jar

application部署模式：

提交任务：

./bin/flink run-application -t yarn-application ./examples/streaming/TopSpeedWindowing.jar

per-job部署模式：

提交任务：

./bin/flink run -t yarn-per-job --detached ./examples/streaming/TopSpeedWindowing.jar

打包任务提交到yarn session运行：

./bin/flink run --class com.bawei.wc2.WindowWordCount /root/myflink.jar

2.3 Connectors (source 和 sink)

2.3.1 source:

（1）kafka：Kafka | Apache Flink

val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "test")
val stream = env
    .addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties))

（2）自定义：

2.3.2 sink:

（1）kafka:Kafka | Apache Flink

val properties = new Properties
properties.setProperty("bootstrap.servers", "hdp1:9092")
val myProducer = new FlinkKafkaProducer[String]("test", new SimpleStringSchema(), properties)
stream.addSink(myProducer)

（2）hdfs:Streaming File Sink | Apache Flink

val input: DataStream[String] = ...

val sink: StreamingFileSink[String] = StreamingFileSink
    .forRowFormat(new Path(outputPath), new SimpleStringEncoder[String]("UTF-8"))
    .withRollingPolicy(
        DefaultRollingPolicy.builder()
            .withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
            .withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
            .withMaxPartSize(1024 * 1024 * 1024)
            .build())
    .build()

input.addSink(sink)

/*
文件滚动策略解释：
It contains at least 15 minutes worth of data
It hasn’t received new records for the last 5 minutes
The file size has reached 1 GB (after writing the last record)
它包含至少15分钟的数据
它在过去5分钟内没有收到新记录
文件大小已达到1 GB（写入最后一条记录后）
*/

自定义sink:

//数据
val input: DataStream[T] = ...
//调用自定义Sink
input.addSink(new MySink())

class MySink extends RichSinkFunction[T]{
    //建立数据库连接
    override def open(parameters: Configuration): Unit = {}
    //关闭连接
    override def close(): Unit = {}
    //执行插入
    override def invoke(value: T, context: SinkFunction.Context): Unit = {}
}

（3）mysql: 3306

//创建连接
Class.forName("com.mysql.jdbc.Driver")
conn = DriverManager.getConnection("jdbc:mysql://hdp1:3306/mydb?characterEncoding=utf8", "root", "root")
ps = conn.prepareStatement("insert into t_wc values(?,?)")
	
//插入数据
ps.setString(1,value._1)
ps.setInt(2,value._2)
ps.executeUpdate()

（4）redis: 6379

//创建连接
jedis = new Jedis("hdp1")
//插入数据
jedis.set(value._1,value._2+"")

（5） es：9200

//创建连接
 client = new RestHighLevelClient(RestClient.builder(new HttpHost("hdp1", 9200, "http")))
//插入数据
val request: IndexRequest = new IndexRequest("t_wc", "doc")
request.source(json, XContentType.JSON)
val indexResponse: IndexResponse = client.index(request, RequestOptions.DEFAULT)

（6）hbase:

（7）clickhouse：

2.4 数据清洗，转换，分组滚动聚合类算子

1、数据清洗转换类：

map

flatmap

filter:

//filter 过滤不符合json的数据
stream2.filter(x => {
    var isJson = true //假设能正常解析（注意：空行也能正常解析）
    try {
        //尝试解析
        JSON.parseObject(x)
    } catch {
        case e: Exception => isJson = false
    }
    isJson && !x.isEmpty
})

2、流合并类：

union:多条流合并，要求流的类型完全一样

connect:实现两条流连接，不要求两条流类型一致

3、分组、滚动聚合类：

keyby->

sum

max/maxby/min/minby

reduce

滚动聚合的理解：

001,手机,iphone12,1,1646038440
004,手机,iphone12,4,1646038443
005,手机,小米16,3,1646038443

输出：
sumDS1>>>> (001,手机,iphone12,1.0,1646038440000)
sumDS1>>>> (001,手机,iphone12,5.0,1646038440000)
sumDS1>>>> (001,手机,iphone12,8.0,1646038440000)

窗口聚合：

窗口结束时间到了才驱动计算，并且每次计算一个窗口内的数据。

2.5 窗口分类：

Windows | Apache Flink

时间窗口（重点）：

滚动窗口：

input
    .keyBy(<key selector>)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>)

滑动窗口：

input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>)

会话窗口：

input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>)

全局窗口：略

计数窗口（了解）：

滚动窗口：

滑动窗口：

2.6 窗口函数：

窗口数据转换：

apply: 可以获得窗口的数据集合，窗口对象（包含窗口开始时间和结束时间）、分组的键。

窗口增量聚合：

Windows | Apache Flink

reduce:

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce(new MyReduceFunction,new MyReduceWindowFunction)
//new MyReduceFunction 输出作为 new MyReduceWindowFunction的输入
//redudce的最终返回类型是由 new MyReduceWindowFunction参数的返回值决定
//聚合
class MyReduceFunction extends ReduceFunction[(String, Double)] {
  override def reduce(x: (String, Double), y: (String, Double)): (String, Double) = {
    (x._1,x._2+y._2)
  }
}
//转换输出
//输入，输出，key,TimeWindow
//输入是MyReduceFunction 的输出
class MyReduceWindowFunction extends WindowFunction[(String, Double),String,String,TimeWindow] {
  override def apply(key: String, window: TimeWindow, input: Iterable[(String, Double)], out: Collector[String]): Unit = {
    // input 是聚合之后的结果
    //reduceDS>>>>> (手机,8.0)
    //reduceDS>>>>> (服装,2.0)
    //reduceDS>>>>> (食品,8.0)
    val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    for(i <-input){
      out.collect(sdf.format(window.getStart)+","+sdf.format(window.getEnd)+","+key+","+i._2)
    }

  }
}

aggregate（略）：

窗口全量聚合：

process：

val input: DataStream[(String, Long)] = ...

input
  .keyBy(_._1)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .process(new MyProcessWindowFunction())

class MyProcessWindowFunction extends ProcessWindowFunction[(String, Long), String, String, TimeWindow] {
  def process(key: String, context: Context, input: Iterable[(String, Long)], out: Collector[String]) = {
    var count = 0L
    for (in <- input) {
      count = count + 1
    }
    out.collect(s"Window ${context.window} count: $count")
  }
}

2.7 事件时间

Timely Stream Processing | Apache Flink

//设置事件时间字段
val timeDS: DataStream[(String, String, String, Double, Long)] = mapDS.assignAscendingTimestamps(x => x._5)

窗口的开始时间和结束时间：

不论系统时间还是事件时间，当一条数据到达的时候就确定了它属于哪个窗口：

5s滚动窗口：

16:54:13 -> 16:54:10, 16:54:15

20s 滚动窗口：

16:54:34 -> 16:54:20 , 16:54:40

5秒、分钟： 0-5, 5-10，10-15,15-20……

1小时：

15:34:00 -> 15:00:00 , 16:00:00

1天：

2023年4月25日 16点 42分00秒 2023年4月25日 08:00:00 2023年4月26日 08:00:00

2.8 watermark

Generating Watermarks | Apache Flink

用于解决事件时间的数据迟到问题。

watermark会延迟窗口的触发时间，在一定程度上解决数据迟到问题。

WatermarkStrategy
  .forBoundedOutOfOrderness[(Long, String)](Duration.ofSeconds(20))
  .withTimestampAssigner(new SerializableTimestampAssigner[(Long, String)] {
    override def extractTimestamp(element: (Long, String), recordTimestamp: Long): Long = element._1
  })

2.9 迟到数据收集

无论是否使用watermark，只要时间语义是事件时间，都可能会有窗口迟到数据。

//创建侧流标签
val tag  = new OutputTag[(String, String, String, Double, Long)]("late-data")
//在使用窗口的时候，通过sideOutputLateData 将窗口迟到数据放到侧流中
val mainDS = timeDS
    .keyBy(_._2)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .sideOutputLateData(tag)
	.process()/reduce()/apply()……
//mainDS是主流运算返回的DataStream类型的结果
//从主流当中提取侧流
val sideDS = mainDS.getSideOutput(tag)

2.10 关联

DataStream与DataStream关联，例如：从kafka过来的订单流，与从kafka另一个主题过来的订单明细流。

Window Join/Window Cogroup ：（支持系统时间和事件时间）

Overview | Apache Flink

dataStream.join(otherStream)
    .where(<key selector>).equalTo(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply { ... }

IntervalJoin：（当前仅支持事件时间）

Joining | Apache Flink

orangeElem.ts + lowerBound <= greenElem.ts <= orangeElem.ts + upperBound
例如：
[-2,1 ]
orangeElem.ts = 2    
要求
greenElem的时间在[0,3] : 0,1,2,3

案例：
[-2,1]
橙色流：
001,90,1,1646038442
绿色流：
1,001,挂面,1646038440
2,001,口红,1646038441
3,001,皮鞋,1646038442
4,001,书包,1646038443
5,001,大米,1646038444
橙色流1646038442要求绿色流时间在[1646038440,1646038443]

import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;

...

val orangeStream: DataStream[Integer] = ...
val greenStream: DataStream[Integer] = ...

orangeStream
    .keyBy(elem => /* select key */)
    .intervalJoin(greenStream.keyBy(elem => /* select key */))
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process(new ProcessJoinFunction[Integer, Integer, String] {
        override def processElement(left: Integer, right: Integer, ctx: ProcessJoinFunction[Integer, Integer, String]#Context, out: Collector[String]): Unit = {
         out.collect(left + "," + right); 
        }
      });
    });

DataStream与数据库关联，例如：从kafka过来的订单流，与存储在mysql/redis等位置的省份表关联。

AsyncIO（异步IO）

Async I/O | Apache Flink

001,90,1,1646038440
002,89,3,1646038441
003,91,9,1646038442
004,10,8,1646038444
005,90,7,1646038445

关联：
001,90,1,1646038440,北京,1,110000,CN-11,CN-BJ
002,89,3,1646038441
003,91,9,1646038442
004,10,8,1646038444
005,90,7,1646038445

1,北京,1,110000,CN-11,CN-BJ
key:"province:1"  value:"北京,1,110000,CN-11,CN-BJ"
key:"province:1"  value:"{'name':"北京",1,110000,CN-11,CN-BJ}"

1,pbgkadgis,99,嘉嘉,顾瑞凡
key:"user:1"  value:"1,pbgkadgis,99,嘉嘉,顾瑞凡"

//要进行关联的输入数据
val oiDS:DataStream[...] = ...

//调用异步io进行关联
val resultStream: DataStream[((String, Double, String, Long),BaseProvince)] =
AsyncDataStream.unorderedWait(oiDS, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100)


//编写异步IO,根据输入数据查询相关的 数据库当中的关联数据
//（1）创建数据库连接
//(2) 执行查询
//(3) 返回关联的结果
class AsyncDatabaseRequest extends RichAsyncFunction[(String, Double, String, Long),((String, Double, String, Long),BaseProvince)] {

  var jedis:Jedis = null
  //打开资源
  override def open(parameters: Configuration): Unit = {
    jedis = new Jedis("hdp1")
  }
  //关闭资源
  override def close(): Unit = jedis.close()

  override def asyncInvoke(input: (String, Double, String, Long), resultFuture: ResultFuture[((String, Double, String, Long), BaseProvince)]): Unit = {
    //input: IN
    //resultFuture:OUT
    //根据输入数据的省份id，从redis取省份的详细信息
    val provincevalue: String = jedis.get("province:" + input._3) //province:1  {"area_code":"110000","name":"北京","region_id":"1","iso_3166_2":"CN-BJ","id":"1","iso_code":"CN-11"}
    //将取到的省份json格式信息转换成对象
    val baseProvince = JSON.parseObject(provincevalue,new BaseProvince().getClass)
    //收集宽表数据返回
    resultFuture.complete(Iterable((input,baseProvince)))
  }

}

2.11 Process

Flink 提供了 8 个 Process Function：

· ProcessFunction ：分流案例

· KeyedProcessFunction ：预警、TopN

· CoProcessFunction

· ProcessJoinFunction· BroadcastProcessFunction

· KeyedBroadcastProcessFunction

· ProcessWindowFunction ：全量聚合

· ProcessAllWindowFunction ：全量聚合

分流案例：

//定义侧流标签
val startTag = new OutputTag[String]("start")
val displaysTag = new OutputTag[String]("displays")
//从输入流中调用processFunction 实现分流
val mainDS = inputDataStream.process(
	new ProcessFunction[输入数据类型，String](context:...,collector:...){
        //侧流数据输出方式
        context.output(displaysTag, i) 
        //主流数据输出方式
        collector.out(i)
    }
)
//输出主流
mainDS.print()
//从主流当中提取侧流,需要传递侧流标签，提取对应的侧流
mainDS.getSideOutput(startTag).print()

温度预警案例：

 ValueState[T]保存单个的值，值的类型为 T。
//创建状态变量
lazy val state: ValueState[T] = getRuntimeContext.getState(new ValueStateDescriptor[T]("wendu", classOf[T]))
//从状态取值
o get 操作: ValueState.value()
//赋值
o set 操作: ValueState.update(value: T)


 ListState[T]保存一个列表，列表里的元素的数据类型为 T。基本操作如下：
lazy val state: ListState[T] = getRuntimeContext.getListState(new ListStateDescriptor[T]("wendu", classOf[T]))
o ListState.add(value: T)
o ListState.addAll(values: java.util.List[T])
o ListState.get()返回Iterable[T]
o ListState.update(values: java.util.List[T])

 MapState[K, V]保存 Key-Value 对。
  lazy val state: MapState[String,Double] = getRuntimeContext.getMapState(new MapStateDescriptor[String,Double]("score", classOf[String], classOf[Double]))

o MapState.get(key: K)
o MapState.put(key: K, value: V)
o MapState.contains(key: K)
o MapState.remove(key: K)

 ReducingState[T]
 AggregatingState[I, O]

定时器：

//注册事件时间定时器
ctx.timerService.registerEventTimeTimer6(未来时间戳) 5:00
//删除定时器6
ctx.timerService.deleteEventTimeTimer(设置定时器时候的时间戳) 5：00
//定时器时间到了会自定执行定时任务，定时任务是一个方法（回调方法99）
onTimer()*{
    //预警
}

TopN案例：

窗口TopN:

非聚合TopN:例如：求每5秒每个学生考试最高的三次成绩

.key()
.window()
.process(new ProcessWindowFunction[.....]( elements:窗口集合){
    	elements.sortby.take()
})

聚合之后TopN（重点掌握）: 求每5秒平均分最高的三个学生及其平均分

//第一步：先对窗口聚合：
.key()
.window()
.process(new ProcessWindowFunction[....](elements:窗口集合)){
    elememts.平均/求和/次数/最大/最小 等聚合
    out.collect((start,end,key,聚合值))
}
//第二步：按照窗口时间分组
聚合DS
.keyby(开始时间或者结束时间)
.process(new KeyedProcessFunction[key,in,out](){
    val listState = //用户收集每个窗口的聚合结果
    processElememt(i:输入数据,context,out){
        listState.add(i)
        //注册定时器为窗口结束时间加一点点时间
        context.timeService.registerEventTimeer(窗口结束时间+n)
    }
    
    onTimer(){
        listState.asScala.toList.sortBy().take(3)
    }
})

实时TopN：

没有经过聚合求TopN，例如：实时每个学生考试最高的三次成绩。略：

聚合之后的TopN: 例如：实时求平均分最高的三个学生及其平均分。

//第一步：聚合
val topn: DataStream[String] = avgDS
    .map(x => ("all", x))
    .keyBy(_._1)
    .process(new KeybyTopNProcess)
//第二步：topn
class KeybyTopNProcess extends KeyedProcessFunction[String,(String,(String,Double)),String] {

  lazy val state: MapState[String,Double] = getRuntimeContext.getMapState(new MapStateDescriptor[String,Double]("score", classOf[String], classOf[Double]))

  override def processElement(i: (String, (String, Double)), context: KeyedProcessFunction[String, (String, (String, Double)), String]#Context, collector: Collector[String]): Unit = {
    state.put(i._2._1,i._2._2)
    val top3 = state.iterator().asScala.toList.map(x=>(x.getKey,x.getValue)).sortBy(-_._2).take(3).toString()
    //top>>>>>> List((zhangsan,7.5), (lisi,7.0), (zhaoliu,6.0))
    collector.collect(top3)
  }
}

2.12 Flink SQL

source -> sql -> sink

Flink sql执行环境：Concepts & Common API | Apache Flink

批流统一表环境创建：

import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;

EnvironmentSettings settings = EnvironmentSettings
    .newInstance()
    .inStreamingMode()
    //.inBatchMode()
    .build();

TableEnvironment tEnv = TableEnvironment.create(settings);

tEnv.executeSql("""
	create table 表名(
		字段名 字段类型
		...
	)WITH(
		//对接数据源 
	)
""")

//with对接数据源的方式：https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/table/filesystem/

创建流处理表环境（仅用于流处理）：

如果需要通过sql分析流，可以使用上述的批流统一方式创建环境，也可以使用下面方式（推荐单独方式）：

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)

//通过以上方式创建流处理表环境，可以使用以下两种方式建表
//方式一：
tEnv.executeSql("""
	create table 表名(
		字段名 字段类型
		...
	)WITH(
		//对接数据源 
	)
""")
//方式二：
val stream:DataStream = env.addSource()
tEnv.createTemporaryView("表名",stream,$("字段名")....)

读取kafka数据并建表：

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)

//方式一：
tEnv.executeSql("""
    CREATE TABLE KafkaTable (
      `user_id` BIGINT,
      `item_id` BIGINT,
      `behavior` STRING,
      `ts` TIMESTAMP(3) METADATA FROM 'timestamp'
    ) WITH (
      'connector' = 'kafka',
      'topic' = 'user_behavior',
      'properties.bootstrap.servers' = 'localhost:9092',
      'properties.group.id' = 'testGroup',
      'scan.startup.mode' = 'earliest-offset',
      'format' = 'csv'
    )
""")
//方式二：
val stream:DataStream = env.addSource(new FlinkKafkaComsumer()...)
tEnv.createTemporaryView("表名",stream,$("字段名")....)

Flink SQL Window

//第一步：抽取事件时间
//方式一：
val timeDS = DataStream.assignAscendingTimestamps(_._5)
tEnv.createTemporaryView(表名,timeDS，....,$("ts").rowtime)
//| ts | TIMESTAMP(3) *ROWTIME* |  2022-02-28 08:54:00.000

//方式二：
tEnv.executeSql( //TO_TIMESTAMP('2023-04-12 12:12:00')
    """
        |create table t_order(
        | ....
        | ts AS TO_TIMESTAMP_LTZ(times*1000,3),
        | WATERMARK FOR ts AS ts - INTERVAL '0' SECOND
        |)WITH(
        |'connector' = 'kafka',
        |  ...
        |)
        |""".stripMargin)

//第二步：定义窗口：
//https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/window-tvf/
//t_order 划5秒窗口表
TABLE(TUMBLE(TABLE t_order ,DESCRIPTOR(ts),INTERVAL '5' SECOND))
当得到窗口表后，会多出两个字段：window_start,window_end
//注意：再对窗口表进行查询的时候，需要按照window_start和window_end 分组。

WindowTopN

Window Top-N | Apache Flink

//注意： 只能用row_number()不支持rank():   RANK() function is not supported on Window TopN currently, only ROW_NUMBER() is supported.
//注意： rk<n  要求n大于1，也就是不能求Top1 可以求Top2,top3 top4……  :Illegal rank end 1, it should be bigger than 1!
//注意：TopN 写完整再运行

Flink SQL Sink:

方式一：insert into 结果表

tEnv.executeSql("""
	create table 结果表(
		字段 字段类型
	)with(
		//保存目的地对应的连接器
	)
""")

table.executeInsert(结果表)

方式二：Table->DataStream[Row] -> addSink

val ds: DataStream[(Boolean, Row)] = tEnv.toRetractStream[Row](table)
val ds2: DataStream[Row] = ds.filter(_._1).map(_._2)

ds2.map(x=>x.toString).addSink(....)

2.13 Flink CDC

热门极速下载/flink-cdc-connectors

或GitHub - ververica/flink-cdc-connectors: CDC Connectors for Apache Flink®

SourceRecord{sourcePartition={server=mysql_binlog_source}, sourceOffset={ts_sec=1683169495, file=mysql-bin.000003, pos=8003731, row=1, server_id=1, event=2}} ConnectRecord{topic='mysql_binlog_source.mydb.t_a', kafkaPartition=null, key=Struct{id=1}, keySchema=Schema{mysql_binlog_source.mydb.t_a.Key:STRUCT}, value=Struct{before=Struct{id=1,name=zhangsan4,age=13},after=Struct{id=1,name=zhangsan5,age=13},source=Struct{version=1.4.1.Final,connector=mysql,name=mysql_binlog_source,ts_ms=1683169495000,db=mydb,table=t_a,server_id=1,file=mysql-bin.000003,pos=8003853,row=0,thread=383},op=u,ts_ms=1683169495522}, valueSchema=Schema{mysql_binlog_source.mydb.t_a.Envelope:STRUCT}, timestamp=null, headers=ConnectHeaders(headers=)}

 1,zhangsan5,13
{"id":1,"name":"zhangsan5","age":13}
{"tb":"t_a","after":{"id":1,"name":"zhangsan5","age":13}}

2.14 Flink CEP

1.输入

2.制定规则

3.CEP.pattern(输入，规则)

4.提取符合规则的数据

5.侧流提取超时数据

3. Hbase

3.1 安装

需要zookeeper 和hadoop

1.解压hbase

[root@hdp1 software]# tar -zxvf hbase-1.3.3-bin.tar.gz -C /opt/module/

2.cd conf 修改配置文件

vim hbase-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_351 export HBASE_MANAGES_ZK=false

3.vim hbase-site.xml

<configuration>
	<property>     
		<name>hbase.rootdir</name>     
		<value>hdfs://hdp1:8020/hbase</value>   
	</property>
	<property>   
		<name>hbase.cluster.distributed</name>
		<value>true</value>
	</property>

   <!-- 0.98后的新变动，之前版本没有.port,默认端口为60000 -->
	<property>
		<name>hbase.master.port</name>
		<value>16000</value>
	</property>

	<property>   
		<name>hbase.zookeeper.quorum</name>
	     <value>hdp1</value>
	</property>
	<property>   
		<name>hbase.zookeeper.property.dataDir</name>
	     <value>/opt/module/apache-zookeeper-3.5.7-bin/data</value>
	</property>
</configuration>

4.vim regionservers

hdp1

5.创建hadoop配置文件软连接 cd conf

ln -s /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml  core-site.xml
ln -s /opt/module/hadoop-3.1.3/etc/hadoop/hdfs-site.xml  hdfs-site.xml

3.2 启动hbase

先启动hadoop 和zookeeper

bin/start-hbase.sh

查看进程：

[root@hdp1 hbase-1.3.3]# jps 2176 SecondaryNameNode 6209 HMaster 2612 NodeManager 6372 HRegionServer 3015 QuorumPeerMain 1801 NameNode 2473 ResourceManager 1949 DataNode 6589 Jps 3423 Kafka

3.3 登录客户端

bin/hbase shell

注意：

hbase存放数据的位置：hadoop中，配置文件hbase-site.xml 中hbase.rootdir指向的位置

zookeeper存放数据的位置：配置文件中dataDir 指向的位置

kafka存放数据的位置：配置文件server.properties中log.dirs指向的位置

hbase 和kafka都有自带的zookeeper，都会用zookeeper去存一点数据。

3.4 基本命令

表空间组命令：（类似于mysql的database）

alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables

list_namespace:列出所有表空间
create_namespace:创建表空间
list_namespace_tables: 列出某个表空间下的所有表

DDL: 表结构相关操作

alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, locate_region, show_filters

create:创建表
drop:删除表
disable/enable:禁用/启用表
list:列出所有用户表空间下的所有表
describe:查看表结构

create 'student','info','extra'
create 'myns:student','info','extra'

DML: 表数据相关操作

append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

增/改：
001，zhangsan,13,pingpong
002，lisi,14

put 't1', 'r1', 'c1', 'value'
put 'student','001','info:name','zhangsan'
put 'student','001','info:age','14'
put 'student','001','info:hobby','pingpong'

put 'student','002','info:name','lisi'
put 'student','002','info:age','14'

查 
scan 'student'
get 'student','001'

删
delete
deleteall 'student','001'

3.5 Phoenix

1.解压phoenix

tar -zxvf phoenix-hbase-1.3-4.16.1-bin.tar.gz -C /opt/module

2.将解压目录中的jar包拷贝到hbase/lib

cp phoenix-server-hbase-1.3-4.16.1.jar /opt/module/hbase-1.3.3/lib/

3.重启hbase

4.通过phoenix访问hbase

命令行客户端：

[root@hdp1 bin]# ./sqlline.py hdp1:2181

API客户端：

Class.forName("org.apache.phoenix.jdbc.PhoenixDriver")
val conn: Connection = DriverManager.getConnection("jdbc:phoenix:hdp1:2181")

如果hbase已经存在某张表，要在phoenix建立其对应关系

 create table "student"(id varchar primary key,
                     "info"."name" varchar, 
                     "info"."age" varchar,
                     "info"."hobby" varchar) 
                     column_encoded_bytes=0;

4. Clickhouse

4.1 安装

1）创建目录 mkdir /opt/software/clickhouse

2）上传文件到上述目录

3）修改系统配置：vim /etc/selinux/config

SELINUX=disabled

4）在线安装依赖包：yum install -y libtool

5）cd /opt/software/clickhouse

rpm -ivh *.rpm

当安装卡在输入密码的地方直接回车即可。

Enter password for default user: 回车，密码为空，如果输入了就使用密码登录

6）修改配置文件

cd /etc/clickhouse-server/

修改为可执行权限

chmod 644 config.xml

修改配置文件内容

vim config.xml

打开 156行的注释：<listen_host>::</listen_host>

7）启动

方式一：以服务的方式启动

systemctl start clickhouse-server

如果不想clickhouse-server开机自启动可以关闭开机自启：systemctl disable clickhouse-server

方式二：sudo -u clickhouse clickhouse-server --config-file=/etc/clickhouse-server/config.xml

客户端访问：

命令行客户端：clickhouse-client -m

web ui客户端: http://hdp1:8123/play

idea db插件或api客户端：

4.2 数据类型

常见数据类型：

Int32 ：相当于mysql的int

Int64 ：相当于mysql的bigint

Float64: 相当于mysql 的double

String

Date: yyyy-MM-dd

DateTime: yyyy-MM-dd HH:mm:ss

4.3 表引擎

create table Tb(
	字段名 字段类型
)engine=MergeTree()
order by 字段名

4.4 函数

开窗函数

--对于21版本 开窗函数是实验性阶段，需要启用
Set allow_experimental_window_functions = 1; -- 不需要记，不设置的时候会在报错信息中有提示
--支持常见开窗函数，如：
row_number()
rank()
dense_rank()

日期函数

select now(); --DateTime类型：
SELECT FROM_UNIXTIME(1669366430);--秒级别时间戳转换成DateTime类型
select toUnixTimestamp(now());--DateTime转换成时间戳（秒级别）
select toUnixTimestamp('2022-11-25 16:50:45'); --String转换成时间戳（秒级别）
select formatDateTime(FROM_UNIXTIME(1669366430),'%Y/%m/%d %H/%M/%S'); --日期时间格式化

4.5 JDBC

Class.forName("ru.yandex.clickhouse.ClickHouseDriver")
val conn: Connection = DriverManager.getConnection("jdbc:clickhouse://hdp1:8123")

扣码

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫