myflink

本文介绍了ElasticSearch的命令行操作,包括创建、删除和查询索引,以及添加、更新和删除文档。接着讲解了Flink的批处理和流处理WordCount示例,详细阐述了数据源、转换、分组和窗口操作。还提到了Flink的JavaAPI、数据源与接收器,以及窗口分类、事件时间和水印的概念。此外,文章讨论了数据清洗、流处理中的窗口和事件时间、watermark、迟到数据处理、异步IO、实时TopN以及FlinkSQL的应用。最后,简要提及了FlinkCDC和CEP,以及Hbase和Clickhouse的基础知识。
摘要由CSDN通过智能技术生成

1.ElasticSearch

1.1 ES命令:

1 a.html 清华大学,简称清华 2 b.html 清华大学高考分数线,…… 3 c.html 清华研究生招生网 4 d.html 北京天安门 5 e.html hello kitty

mysearch: {"id":"1","filename":"a.html","content":"清华大学,简称清华"}

索引:

​
创建索引:
PUT mysearch
PUT mysearch2
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3, 
            "number_of_replicas" : 2 
        }
    }
}
​
删除索引:
DELETE person
​
查看索引信息:
GET mysearch2
返回结果:
{
  "mysearch2" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "creation_date" : "1681875342768",
        "number_of_shards" : "3",
        "number_of_replicas" : "2",
        "uuid" : "r3Q7_PsZQyGGljoCp6xxWg",
        "version" : {
          "created" : "6060299"
        },
        "provided_name" : "mysearch2"
      }
    }
  }
}

文档: 添加、修改文档: {"id":"1","filename":"a.html","content":"清华大学,简称清华"} {"id":"2","filename":"b.html","content":"清华大学高考分数线"} {"id":"3","filename":"c.html","content":"清华研究生招生网"} {"id":"4","filename":"d.html","content":"北京天安门"} {"id":"5","filename":"e.html","content":"hello kitty"}

PUT mysearch/_doc/1
{"id":"1","filename":"a.html","content":"清华大学,简称清华"}
​
PUT mysearch/_doc/2
{"id":"2","filename":"b.html","content":"清华大学高考分数线"}
​
PUT mysearch/_doc/3
{"id":"3","filename":"c.html","content":"清华研究生招生网"}
​
PUT mysearch/_doc/4
{"id":"4","filename":"d.html","content":"北京天安门"}
​
PUT mysearch/_doc/5
{"id":"5","filename":"e.html","content":"hello kitty"}
​
PUT mysearch/_doc/
{"id":"6","filename":"f.html","content":"hello snoopy"}
​
根据_id查询文档:
    GET mysearch/_doc/3
    返回内容:
    {
      "_index" : "mysearch",
      "_type" : "_doc",
      "_id" : "3",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "id" : "3",
        "filename" : "c.html",
        "content" : "清华研究生招生网"
      }
    }
​
​
    
删除文档:
DELETE /mysearch/_doc/1
​
搜索文档:
POST mysearch/_search
{
    "query" : {
        "match" : {"content":"清华大学"}
    }
}
​
带高亮的搜索:(高亮字段要与搜索的字段一致)
GET /_search
{
    "query" : {
        "match": { "content": "清华大学" }
    },
    "highlight" : {
      "pre_tags":"<font color='red'>",
      "post_tags": "</font>", 
        "fields" : {
            "content" : {}
        }
    }
}

分词: hello kitty分词:

hello e.html kitty f.html

清华大学分词: 清 b.html,c.html 华 …… 大 …… 学 ……

清华 …… 大学 …… 清华大学 ……

1.2 ES Java API:

编码结构: 1、创建连接 2、创建Request请求 3、提交Request请求 4、返回结果集

2. Flink

官网:Apache Flink Documentation | Apache Flink

2.1 Flink wordcount

批处理:wordcount:Overview | Apache Flink

import org.apache.flink.api.scala._
​
object WordCount {
  def main(args: Array[String]) {
​
    val env = ExecutionEnvironment.getExecutionEnvironment
    val text = env.fromElements(
      "Who's there?",
      "I think I hear them. Stand, ho! Who's there?")
​
    val counts = text
      .flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
      .groupBy(0)
      .sum(1)
​
    counts.print()
  }
}

流处理:wordcount:Overview | Apache Flink

import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
​
object WindowWordCount {
  def main(args: Array[String]) {
​
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("localhost", 9999)
​
    val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
      .keyBy(_._1)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .sum(1)
​
    counts.print()
​
    env.execute("Window Stream WordCount")
  }
}

处理过程:

1、创建批处理/流处理的执行环境

2、读取数据源(kafka mysql等)

3、转换分析

4、数据落地(保存到hdfs mysql redis 等)

5、提交执行(部署)

2.2 打包部署

任务Job: jar包

Standalone:

Overview | Apache Flink

需要启动flink进程:bin/start-cluster.sh

StandaloneSessionClusterEntrypoint: 管理所有job

TaskManagerRunner : 用于执行算子task

访问:http://hdp1:8081/#/overview

资源分配单位是:Slot

提交任务:./bin/flink run ./examples/streaming/TopSpeedWindowing.jar

>注意:如果曾经在yarn上提交过任务,那么回到standalone重新运行则会失败,原因是有yarn的临时文件存在,任务就会找yarn来运行。方法是将yarn的临时文件删除:rm -rf /tmp/.yarn-properties-root

Yarn:

YARN | Apache Flink

不需要启动flink进程:

start-all.sh :

访问:http://hdp1:8088

资源分配单位:Container

提交任务:

session部署模式:

先启动yarn-session(中介):./bin/yarn-session.sh --detached

提交任务:./bin/flink run ./examples/streaming/TopSpeedWindowing.jar

application部署模式:

提交任务:

./bin/flink run-application -t yarn-application ./examples/streaming/TopSpeedWindowing.jar

per-job部署模式:

提交任务:

./bin/flink run -t yarn-per-job --detached ./examples/streaming/TopSpeedWindowing.jar

打包任务提交到yarn session运行:

./bin/flink run --class com.bawei.wc2.WindowWordCount /root/myflink.jar

2.3 Connectors (source 和 sink)
2.3.1 source:

(1)kafka:Kafka | Apache Flink

val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "test")
val stream = env
    .addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties))

(2)自定义:

2.3.2 sink:

(1)kafka:Kafka | Apache Flink

val properties = new Properties
properties.setProperty("bootstrap.servers", "hdp1:9092")
val myProducer = new FlinkKafkaProducer[String]("test", new SimpleStringSchema(), properties)
stream.addSink(myProducer)

(2)hdfs:Streaming File Sink | Apache Flink

val input: DataStream[String] = ...

val sink: StreamingFileSink[String] = StreamingFileSink
    .forRowFormat(new Path(outputPath), new SimpleStringEncoder[String]("UTF-8"))
    .withRollingPolicy(
        DefaultRollingPolicy.builder()
            .withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
            .withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
            .withMaxPartSize(1024 * 1024 * 1024)
            .build())
    .build()

input.addSink(sink)

/*
文件滚动策略解释:
It contains at least 15 minutes worth of data
It hasn’t received new records for the last 5 minutes
The file size has reached 1 GB (after writing the last record)
它包含至少15分钟的数据
它在过去5分钟内没有收到新记录
文件大小已达到1 GB(写入最后一条记录后)
*/

自定义sink:

//数据
val input: DataStream[T] = ...
//调用自定义Sink
input.addSink(new MySink())

class MySink extends RichSinkFunction[T]{
    //建立数据库连接
    override def open(parameters: Configuration): Unit = {}
    //关闭连接
    override def close(): Unit = {}
    //执行插入
    override def invoke(value: T, context: SinkFunction.Context): Unit = {}
}

(3)mysql: 3306

//创建连接
Class.forName("com.mysql.jdbc.Driver")
conn = DriverManager.getConnection("jdbc:mysql://hdp1:3306/mydb?characterEncoding=utf8", "root", "root")
ps = conn.prepareStatement("insert into t_wc values(?,?)")
	
//插入数据
ps.setString(1,value._1)
ps.setInt(2,value._2)
ps.executeUpdate()

(4)redis: 6379

//创建连接
jedis = new Jedis("hdp1")
//插入数据
jedis.set(value._1,value._2+"")

(5) es:9200

//创建连接
 client = new RestHighLevelClient(RestClient.builder(new HttpHost("hdp1", 9200, "http")))
//插入数据
val request: IndexRequest = new IndexRequest("t_wc", "doc")
request.source(json, XContentType.JSON)
val indexResponse: IndexResponse = client.index(request, RequestOptions.DEFAULT)

(6)hbase:

(7)clickhouse:

2.4 数据清洗,转换,分组滚动聚合类算子

1、数据清洗转换类:

map

flatmap

filter:

//filter 过滤不符合json的数据
stream2.filter(x => {
    var isJson = true //假设能正常解析(注意:空行也能正常解析)
    try {
        //尝试解析
        JSON.parseObject(x)
    } catch {
        case e: Exception => isJson = false
    }
    isJson && !x.isEmpty
})

2、流合并类:

union:多条流合并,要求流的类型完全一样

connect:实现两条流连接,不要求两条流类型一致

3、分组、滚动聚合类:

keyby->

sum

max/maxby/min/minby

reduce

滚动聚合的理解:

001,手机,iphone12,1,1646038440
004,手机,iphone12,4,1646038443
005,手机,小米16,3,1646038443

输出:
sumDS1>>>> (001,手机,iphone12,1.0,1646038440000)
sumDS1>>>> (001,手机,iphone12,5.0,1646038440000)
sumDS1>>>> (001,手机,iphone12,8.0,1646038440000)

窗口聚合:

窗口结束时间到了才驱动计算,并且每次计算一个窗口内的数据。

2.5 窗口分类:

Windows | Apache Flink

时间窗口(重点):

滚动窗口:

input
    .keyBy(<key selector>)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>)

滑动窗口:

input
    .keyBy(<key selector>)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>)

会话窗口:

input
    .keyBy(<key selector>)
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>)

全局窗口:略

计数窗口(了解):

滚动窗口:

滑动窗口:

2.6 窗口函数:

窗口数据转换:

apply: 可以获得窗口的数据集合,窗口对象(包含窗口开始时间和结束时间)、分组的键。

窗口增量聚合:

Windows | Apache Flink

reduce:

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce(new MyReduceFunction,new MyReduceWindowFunction)
//new MyReduceFunction 输出作为 new MyReduceWindowFunction的输入
//redudce的最终返回类型是由 new MyReduceWindowFunction参数的返回值决定
//聚合
class MyReduceFunction extends ReduceFunction[(String, Double)] {
  override def reduce(x: (String, Double), y: (String, Double)): (String, Double) = {
    (x._1,x._2+y._2)
  }
}
//转换输出
//输入,输出,key,TimeWindow
//输入是MyReduceFunction 的输出
class MyReduceWindowFunction extends WindowFunction[(String, Double),String,String,TimeWindow] {
  override def apply(key: String, window: TimeWindow, input: Iterable[(String, Double)], out: Collector[String]): Unit = {
    // input 是聚合之后的结果
    //reduceDS>>>>> (手机,8.0)
    //reduceDS>>>>> (服装,2.0)
    //reduceDS>>>>> (食品,8.0)
    val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    for(i <-input){
      out.collect(sdf.format(window.getStart)+","+sdf.format(window.getEnd)+","+key+","+i._2)
    }

  }
}

aggregate(略):

窗口全量聚合:

process:

val input: DataStream[(String, Long)] = ...

input
  .keyBy(_._1)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .process(new MyProcessWindowFunction())

class MyProcessWindowFunction extends ProcessWindowFunction[(String, Long), String, String, TimeWindow] {
  def process(key: String, context: Context, input: Iterable[(String, Long)], out: Collector[String]) = {
    var count = 0L
    for (in <- input) {
      count = count + 1
    }
    out.collect(s"Window ${context.window} count: $count")
  }
}
2.7 事件时间

Timely Stream Processing | Apache Flink

//设置事件时间字段
val timeDS: DataStream[(String, String, String, Double, Long)] = mapDS.assignAscendingTimestamps(x => x._5)

窗口的开始时间和结束时间:

不论系统时间还是事件时间,当一条数据到达的时候就确定了它属于哪个窗口:

5s滚动窗口:

16:54:13 -> 16:54:10, 16:54:15

20s 滚动窗口:

16:54:34 -> 16:54:20 , 16:54:40

5秒、分钟: 0-5, 5-10,10-15,15-20……

1小时:

15:34:00 -> 15:00:00 , 16:00:00

1天:

2023年4月25日 16点 42分00秒 2023年4月25日 08:00:00 2023年4月26日 08:00:00

2.8 watermark

Generating Watermarks | Apache Flink

用于解决事件时间的数据迟到问题。

watermark会延迟窗口的触发时间,在一定程度上解决数据迟到问题。

WatermarkStrategy
  .forBoundedOutOfOrderness[(Long, String)](Duration.ofSeconds(20))
  .withTimestampAssigner(new SerializableTimestampAssigner[(Long, String)] {
    override def extractTimestamp(element: (Long, String), recordTimestamp: Long): Long = element._1
  })
2.9 迟到数据收集

无论是否使用watermark,只要时间语义是事件时间,都可能会有窗口迟到数据。

//创建侧流标签
val tag  = new OutputTag[(String, String, String, Double, Long)]("late-data")
//在使用窗口的时候,通过sideOutputLateData 将窗口迟到数据放到侧流中
val mainDS = timeDS
    .keyBy(_._2)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .sideOutputLateData(tag)
	.process()/reduce()/apply()……
//mainDS是主流运算返回的DataStream类型的结果
//从主流当中提取侧流
val sideDS = mainDS.getSideOutput(tag)

2.10 关联

DataStream与DataStream关联,例如:从kafka过来的订单流,与从kafka另一个主题过来的订单明细流。

Window Join/Window Cogroup :(支持系统时间和事件时间)

Overview | Apache Flink

dataStream.join(otherStream)
    .where(<key selector>).equalTo(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply { ... }

IntervalJoin:(当前仅支持事件时间)

Joining | Apache Flink

orangeElem.ts + lowerBound <= greenElem.ts <= orangeElem.ts + upperBound
例如:
[-2,1 ]
orangeElem.ts = 2    
要求
greenElem的时间在[0,3] : 0,1,2,3

案例:
[-2,1]
橙色流:
001,90,1,1646038442
绿色流:
1,001,挂面,1646038440
2,001,口红,1646038441
3,001,皮鞋,1646038442
4,001,书包,1646038443
5,001,大米,1646038444
橙色流1646038442要求绿色流时间在[1646038440,1646038443]

import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;

...

val orangeStream: DataStream[Integer] = ...
val greenStream: DataStream[Integer] = ...

orangeStream
    .keyBy(elem => /* select key */)
    .intervalJoin(greenStream.keyBy(elem => /* select key */))
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process(new ProcessJoinFunction[Integer, Integer, String] {
        override def processElement(left: Integer, right: Integer, ctx: ProcessJoinFunction[Integer, Integer, String]#Context, out: Collector[String]): Unit = {
         out.collect(left + "," + right); 
        }
      });
    });

DataStream与数据库关联,例如:从kafka过来的订单流,与存储在mysql/redis等位置的 省份表关联。

AsyncIO(异步IO)

Async I/O | Apache Flink

001,90,1,1646038440
002,89,3,1646038441
003,91,9,1646038442
004,10,8,1646038444
005,90,7,1646038445

关联:
001,90,1,1646038440,北京,1,110000,CN-11,CN-BJ
002,89,3,1646038441
003,91,9,1646038442
004,10,8,1646038444
005,90,7,1646038445

1,北京,1,110000,CN-11,CN-BJ
key:"province:1"  value:"北京,1,110000,CN-11,CN-BJ"
key:"province:1"  value:"{'name':"北京",1,110000,CN-11,CN-BJ}"

1,pbgkadgis,99,嘉嘉,顾瑞凡
key:"user:1"  value:"1,pbgkadgis,99,嘉嘉,顾瑞凡"

//要进行关联的输入数据
val oiDS:DataStream[...] = ...

//调用异步io进行关联
val resultStream: DataStream[((String, Double, String, Long),BaseProvince)] =
AsyncDataStream.unorderedWait(oiDS, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100)


//编写异步IO,根据输入数据查询相关的 数据库当中的关联数据
//(1)创建数据库连接
//(2) 执行查询
//(3) 返回关联的结果
class AsyncDatabaseRequest extends RichAsyncFunction[(String, Double, String, Long),((String, Double, String, Long),BaseProvince)] {

  var jedis:Jedis = null
  //打开资源
  override def open(parameters: Configuration): Unit = {
    jedis = new Jedis("hdp1")
  }
  //关闭资源
  override def close(): Unit = jedis.close()

  override def asyncInvoke(input: (String, Double, String, Long), resultFuture: ResultFuture[((String, Double, String, Long), BaseProvince)]): Unit = {
    //input: IN
    //resultFuture:OUT
    //根据输入数据的省份id,从redis取省份的详细信息
    val provincevalue: String = jedis.get("province:" + input._3) //province:1  {"area_code":"110000","name":"北京","region_id":"1","iso_3166_2":"CN-BJ","id":"1","iso_code":"CN-11"}
    //将取到的省份json格式信息转换成对象
    val baseProvince = JSON.parseObject(provincevalue,new BaseProvince().getClass)
    //收集宽表数据返回
    resultFuture.complete(Iterable((input,baseProvince)))
  }

}

2.11 Process

Flink 提供了 8 个 Process Function:

· ProcessFunction : 分流案例

· KeyedProcessFunction : 预警、TopN

· CoProcessFunction

· ProcessJoinFunction· BroadcastProcessFunction

· KeyedBroadcastProcessFunction

· ProcessWindowFunction : 全量聚合

· ProcessAllWindowFunction :全量聚合

分流案例:

//定义侧流标签
val startTag = new OutputTag[String]("start")
val displaysTag = new OutputTag[String]("displays")
//从输入流中调用processFunction 实现分流
val mainDS = inputDataStream.process(
	new ProcessFunction[输入数据类型,String](context:...,collector:...){
        //侧流数据输出方式
        context.output(displaysTag, i) 
        //主流数据输出方式
        collector.out(i)
    }
)
//输出主流
mainDS.print()
//从主流当中提取侧流,需要传递侧流标签,提取对应的侧流
mainDS.getSideOutput(startTag).print()

温度预警案例:

 ValueState[T]保存单个的值,值的类型为 T。
//创建状态变量
lazy val state: ValueState[T] = getRuntimeContext.getState(new ValueStateDescriptor[T]("wendu", classOf[T]))
//从状态取值
o get 操作: ValueState.value()
//赋值
o set 操作: ValueState.update(value: T)


 ListState[T]保存一个列表,列表里的元素的数据类型为 T。基本操作如下:
lazy val state: ListState[T] = getRuntimeContext.getListState(new ListStateDescriptor[T]("wendu", classOf[T]))
o ListState.add(value: T)
o ListState.addAll(values: java.util.List[T])
o ListState.get()返回Iterable[T]
o ListState.update(values: java.util.List[T])

 MapState[K, V]保存 Key-Value 对。
  lazy val state: MapState[String,Double] = getRuntimeContext.getMapState(new MapStateDescriptor[String,Double]("score", classOf[String], classOf[Double]))

o MapState.get(key: K)
o MapState.put(key: K, value: V)
o MapState.contains(key: K)
o MapState.remove(key: K)

 ReducingState[T]
 AggregatingState[I, O]

定时器:

//注册事件时间定时器
ctx.timerService.registerEventTimeTimer6(未来时间戳) 5:00
//删除定时器6
ctx.timerService.deleteEventTimeTimer(设置定时器时候的时间戳) 5:00
//定时器时间到了会自定执行定时任务,定时任务是一个方法(回调方法99)
onTimer()*{
    //预警
}

TopN案例:

窗口TopN:

非聚合TopN:例如:求每5秒每个学生考试最高的三次成绩

.key()
.window()
.process(new ProcessWindowFunction[.....]( elements:窗口集合){
    	elements.sortby.take()
})

聚合之后TopN(重点掌握): 求每5秒平均分最高的三个学生及其平均分

//第一步:先对窗口聚合:
.key()
.window()
.process(new ProcessWindowFunction[....](elements:窗口集合)){
    elememts.平均/求和/次数/最大/最小 等聚合
    out.collect((start,end,key,聚合值))
}
//第二步:按照窗口时间分组
聚合DS
.keyby(开始时间或者结束时间)
.process(new KeyedProcessFunction[key,in,out](){
    val listState = //用户收集每个窗口的聚合结果
    processElememt(i:输入数据,context,out){
        listState.add(i)
        //注册定时器为窗口结束时间加一点点时间
        context.timeService.registerEventTimeer(窗口结束时间+n)
    }
    
    onTimer(){
        listState.asScala.toList.sortBy().take(3)
    }
})

实时TopN:

没有经过聚合求TopN,例如:实时每个学生考试最高的三次成绩。略:

聚合之后的TopN: 例如:实时求平均分最高的三个学生及其平均分。

//第一步:聚合
val topn: DataStream[String] = avgDS
    .map(x => ("all", x))
    .keyBy(_._1)
    .process(new KeybyTopNProcess)
//第二步:topn
class KeybyTopNProcess extends KeyedProcessFunction[String,(String,(String,Double)),String] {

  lazy val state: MapState[String,Double] = getRuntimeContext.getMapState(new MapStateDescriptor[String,Double]("score", classOf[String], classOf[Double]))

  override def processElement(i: (String, (String, Double)), context: KeyedProcessFunction[String, (String, (String, Double)), String]#Context, collector: Collector[String]): Unit = {
    state.put(i._2._1,i._2._2)
    val top3 = state.iterator().asScala.toList.map(x=>(x.getKey,x.getValue)).sortBy(-_._2).take(3).toString()
    //top>>>>>> List((zhangsan,7.5), (lisi,7.0), (zhaoliu,6.0))
    collector.collect(top3)
  }
}

2.12 Flink SQL

source -> sql -> sink

Flink sql执行环境:Concepts & Common API | Apache Flink

批流统一表环境创建:

import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;

EnvironmentSettings settings = EnvironmentSettings
    .newInstance()
    .inStreamingMode()
    //.inBatchMode()
    .build();

TableEnvironment tEnv = TableEnvironment.create(settings);

tEnv.executeSql("""
	create table 表名(
		字段名 字段类型
		...
	)WITH(
		//对接数据源 
	)
""")

//with对接数据源的方式:https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/table/filesystem/

创建流处理表环境(仅用于流处理):

如果需要通过sql分析流,可以使用上述的批流统一方式创建环境,也可以使用下面方式(推荐单独方式):

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)

//通过以上方式创建流处理表环境,可以使用以下两种方式建表
//方式一:
tEnv.executeSql("""
	create table 表名(
		字段名 字段类型
		...
	)WITH(
		//对接数据源 
	)
""")
//方式二:
val stream:DataStream = env.addSource()
tEnv.createTemporaryView("表名",stream,$("字段名")....)

读取kafka数据并建表:

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)

//方式一:
tEnv.executeSql("""
    CREATE TABLE KafkaTable (
      `user_id` BIGINT,
      `item_id` BIGINT,
      `behavior` STRING,
      `ts` TIMESTAMP(3) METADATA FROM 'timestamp'
    ) WITH (
      'connector' = 'kafka',
      'topic' = 'user_behavior',
      'properties.bootstrap.servers' = 'localhost:9092',
      'properties.group.id' = 'testGroup',
      'scan.startup.mode' = 'earliest-offset',
      'format' = 'csv'
    )
""")
//方式二:
val stream:DataStream = env.addSource(new FlinkKafkaComsumer()...)
tEnv.createTemporaryView("表名",stream,$("字段名")....)

Flink SQL Window

//第一步:抽取事件时间
//方式一:
val timeDS = DataStream.assignAscendingTimestamps(_._5)
tEnv.createTemporaryView(表名,timeDS,....,$("ts").rowtime)
//| ts | TIMESTAMP(3) *ROWTIME* |  2022-02-28 08:54:00.000

//方式二:
tEnv.executeSql( //TO_TIMESTAMP('2023-04-12 12:12:00')
    """
        |create table t_order(
        | ....
        | ts AS TO_TIMESTAMP_LTZ(times*1000,3),
        | WATERMARK FOR ts AS ts - INTERVAL '0' SECOND
        |)WITH(
        |'connector' = 'kafka',
        |  ...
        |)
        |""".stripMargin)

//第二步:定义窗口:
//https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/window-tvf/
//t_order 划5秒窗口表
TABLE(TUMBLE(TABLE t_order ,DESCRIPTOR(ts),INTERVAL '5' SECOND))
当得到窗口表后,会多出两个字段:window_start,window_end
//注意:再对窗口表进行查询的时候,需要按照window_start和window_end 分组。

WindowTopN

Window Top-N | Apache Flink

//注意: 只能用row_number()不支持rank():   RANK() function is not supported on Window TopN currently, only ROW_NUMBER() is supported.
//注意: rk<n  要求n大于1,也就是不能求Top1 可以求Top2,top3 top4……  :Illegal rank end 1, it should be bigger than 1!
//注意:TopN 写完整再运行

Flink SQL Sink:

方式一:insert into 结果表

tEnv.executeSql("""
	create table 结果表(
		字段 字段类型
	)with(
		//保存目的地对应的连接器
	)
""")

table.executeInsert(结果表)

方式二:Table->DataStream[Row] -> addSink

val ds: DataStream[(Boolean, Row)] = tEnv.toRetractStream[Row](table)
val ds2: DataStream[Row] = ds.filter(_._1).map(_._2)

ds2.map(x=>x.toString).addSink(....)

2.13 Flink CDC

热门极速下载/flink-cdc-connectors

GitHub - ververica/flink-cdc-connectors: CDC Connectors for Apache Flink®

SourceRecord{sourcePartition={server=mysql_binlog_source}, sourceOffset={ts_sec=1683169495, file=mysql-bin.000003, pos=8003731, row=1, server_id=1, event=2}} ConnectRecord{topic='mysql_binlog_source.mydb.t_a', kafkaPartition=null, key=Struct{id=1}, keySchema=Schema{mysql_binlog_source.mydb.t_a.Key:STRUCT}, value=Struct{before=Struct{id=1,name=zhangsan4,age=13},after=Struct{id=1,name=zhangsan5,age=13},source=Struct{version=1.4.1.Final,connector=mysql,name=mysql_binlog_source,ts_ms=1683169495000,db=mydb,table=t_a,server_id=1,file=mysql-bin.000003,pos=8003853,row=0,thread=383},op=u,ts_ms=1683169495522}, valueSchema=Schema{mysql_binlog_source.mydb.t_a.Envelope:STRUCT}, timestamp=null, headers=ConnectHeaders(headers=)}

 1,zhangsan5,13
{"id":1,"name":"zhangsan5","age":13}
{"tb":"t_a","after":{"id":1,"name":"zhangsan5","age":13}}

2.14 Flink CEP

1.输入

2.制定规则

3.CEP.pattern(输入,规则)

4.提取符合规则的数据

5.侧流提取超时数据

3. Hbase

3.1 安装

需要zookeeper 和hadoop

1.解压hbase

[root@hdp1 software]# tar -zxvf hbase-1.3.3-bin.tar.gz -C /opt/module/

2.cd conf 修改配置文件

vim hbase-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_351 export HBASE_MANAGES_ZK=false

3.vim hbase-site.xml

<configuration>
	<property>     
		<name>hbase.rootdir</name>     
		<value>hdfs://hdp1:8020/hbase</value>   
	</property>
	<property>   
		<name>hbase.cluster.distributed</name>
		<value>true</value>
	</property>

   <!-- 0.98后的新变动,之前版本没有.port,默认端口为60000 -->
	<property>
		<name>hbase.master.port</name>
		<value>16000</value>
	</property>

	<property>   
		<name>hbase.zookeeper.quorum</name>
	     <value>hdp1</value>
	</property>
	<property>   
		<name>hbase.zookeeper.property.dataDir</name>
	     <value>/opt/module/apache-zookeeper-3.5.7-bin/data</value>
	</property>
</configuration>

4.vim regionservers

hdp1

5.创建hadoop配置文件软连接 cd conf

ln -s /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml  core-site.xml
ln -s /opt/module/hadoop-3.1.3/etc/hadoop/hdfs-site.xml  hdfs-site.xml

3.2 启动hbase

先启动hadoop 和zookeeper

bin/start-hbase.sh

查看进程:

[root@hdp1 hbase-1.3.3]# jps 2176 SecondaryNameNode 6209 HMaster 2612 NodeManager 6372 HRegionServer 3015 QuorumPeerMain 1801 NameNode 2473 ResourceManager 1949 DataNode 6589 Jps 3423 Kafka

3.3 登录客户端

bin/hbase shell

注意:

hbase存放数据的位置:hadoop中,配置文件hbase-site.xml 中hbase.rootdir指向的位置

zookeeper存放数据的位置:配置文件中dataDir 指向的位置

kafka存放数据的位置:配置文件server.properties中log.dirs指向的位置

hbase 和kafka都有自带的zookeeper,都会用zookeeper去存一点数据。

3.4 基本命令

表空间组命令:(类似于mysql的database)

alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables

list_namespace:列出所有表空间
create_namespace:创建表空间
list_namespace_tables: 列出某个表空间下的所有表

DDL: 表结构相关操作

alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, locate_region, show_filters

create:创建表
drop:删除表
disable/enable:禁用/启用表
list:列出所有用户表空间下的所有表
describe:查看表结构

create 'student','info','extra'
create 'myns:student','info','extra'

DML: 表数据相关操作

append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

增/改:
001,zhangsan,13,pingpong
002,lisi,14

put 't1', 'r1', 'c1', 'value'
put 'student','001','info:name','zhangsan'
put 'student','001','info:age','14'
put 'student','001','info:hobby','pingpong'

put 'student','002','info:name','lisi'
put 'student','002','info:age','14'

查 
scan 'student'
get 'student','001'

删
delete
deleteall 'student','001'

3.5 Phoenix

1.解压phoenix

tar -zxvf phoenix-hbase-1.3-4.16.1-bin.tar.gz -C /opt/module

2.将解压目录中的jar包拷贝到hbase/lib

cp phoenix-server-hbase-1.3-4.16.1.jar /opt/module/hbase-1.3.3/lib/

3.重启hbase

4.通过phoenix访问hbase

命令行客户端:

[root@hdp1 bin]# ./sqlline.py hdp1:2181

API客户端:

Class.forName("org.apache.phoenix.jdbc.PhoenixDriver")
val conn: Connection = DriverManager.getConnection("jdbc:phoenix:hdp1:2181")

如果hbase已经存在某张表,要在phoenix建立其对应关系

 create table "student"(id varchar primary key,
                     "info"."name" varchar, 
                     "info"."age" varchar,
                     "info"."hobby" varchar) 
                     column_encoded_bytes=0;

4. Clickhouse

4.1 安装

1)创建目录 mkdir /opt/software/clickhouse

2)上传文件到上述目录

3)修改系统配置:vim /etc/selinux/config

SELINUX=disabled

4)在线安装依赖包:yum install -y libtool

5)cd /opt/software/clickhouse

rpm -ivh *.rpm

当安装卡在输入密码的地方直接回车即可。

Enter password for default user: 回车,密码为空,如果输入了就使用密码登录

6)修改配置文件

cd /etc/clickhouse-server/

修改为可执行权限

chmod 644 config.xml

修改配置文件内容

vim config.xml

打开 156行的注释:<listen_host>::</listen_host>

7)启动

方式一:以服务的方式启动

systemctl start clickhouse-server

如果不想clickhouse-server开机自启动可以关闭开机自启:systemctl disable clickhouse-server

方式二:sudo -u clickhouse clickhouse-server --config-file=/etc/clickhouse-server/config.xml

  1. 客户端访问:

命令行客户端:clickhouse-client -m

web ui客户端: http://hdp1:8123/play

idea db插件或api客户端:

4.2 数据类型

常见数据类型:

Int32 :相当于mysql的int

Int64 :相当于mysql的bigint

Float64: 相当于mysql 的double

String

Date: yyyy-MM-dd

DateTime: yyyy-MM-dd HH:mm:ss

4.3 表引擎
create table Tb(
	字段名 字段类型
)engine=MergeTree()
order by 字段名

4.4 函数

开窗函数

--对于21版本 开窗函数是实验性阶段,需要启用
Set allow_experimental_window_functions = 1; -- 不需要记,不设置的时候会在报错信息中有提示
--支持常见开窗函数,如:
row_number()
rank()
dense_rank()

日期函数

select now(); --DateTime类型:
SELECT FROM_UNIXTIME(1669366430);--秒级别时间戳转换成DateTime类型
select toUnixTimestamp(now());--DateTime转换成时间戳(秒级别)
select toUnixTimestamp('2022-11-25 16:50:45'); --String转换成时间戳(秒级别)
select formatDateTime(FROM_UNIXTIME(1669366430),'%Y/%m/%d %H/%M/%S'); --日期时间格式化
4.5 JDBC
Class.forName("ru.yandex.clickhouse.ClickHouseDriver")
val conn: Connection = DriverManager.getConnection("jdbc:clickhouse://hdp1:8123")

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值