Flink + Iceberg +对象存储

最新推荐文章于 2023-12-21 16:06:10 发布

hello youth

最新推荐文章于 2023-12-21 16:06:10 发布

阅读量746

点赞数 1

分类专栏：大数据基础环境搭建文章标签： flink 大数据

本文链接：https://blog.csdn.net/qq_36567420/article/details/132725769

版权

大数据基础环境搭建专栏收录该内容

12 篇文章 0 订阅

订阅专栏

背景

因为项目需要，之前基于Hadoop+yarn+flink+hdfs+hive 构建一套文件存储体系，但是由于Hadoop商业发行版cdh和hdp开始收费，开始思考如何构建没有hadoop生态的数据湖，搜集网上资料，尝试基于现代存储S3或者OSS来代替HDFS，使用k8s + kafka + Flink + iceberg + trino构建实时计算体系。网上的教程大多问题很多，记录下来以作参考。

前提

安装k8s集群、 minio（省略）

安装

一、kafka安装——推荐使用Strimzi快速搭建Kafka

---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
  namespace: kafka
spec:
  entityOperator:
    topicOperator: {}
    userOperator: {}
  kafka:
    config:
      default.replication.factor: 1
      inter.broker.protocol.version: '3.3'
      min.insync.replicas: 1
      offsets.topic.replication.factor: 1
      transaction.state.log.min.isr: 1
      transaction.state.log.replication.factor: 1
    listeners:
      - configuration:
          bootstrap:
            nodePort: 32410
          brokers:
            - broker: 0
              nodePort: 32420
            - broker: 1
              nodePort: 32421
            - broker: 2
              nodePort: 32422
        name: external
        port: 9094
        tls: false
        type: nodeport
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: tls
        port: 9093
        tls: true
        type: internal
    replicas: 3
    storage:
      type: jbod
      volumes:
        - class: ceph-kafka
          deleteClaim: false
          id: 0
          size: 100Gi
          type: persistent-claim
    version: 3.3.1
  zookeeper:
    replicas: 3
    storage:
      class: ceph-kafka
      deleteClaim: false
      size: 100Gi
      type: persistent-claim

二、flink编译安装支持s3的镜像

2.1 提前下载

aws-java-sdk-bundle-1.11.375.jar
commons-cli-1.5.0.jar
flink-s3-fs-hadoop-1.14.6.jar
flink-shaded-hadoop-3-uber-3.1.1.7.2.9.0-173-9.0.jar
hadoop-aws-3.2.2.jar
guava-27.0-jre.jar

2.2 core-site.xml

Hadoop Core的配置项，无需安装hadoop，但是连接s3还是使用hadoop的方法连接，这里使用的是s3a，当然也可以使用s3、s3p，关于s3、s3a和s3p的区别，可以参考flink的官网介绍：Amazon S3 | Apache Flink

<configuration  xmlns:xi="http://www.w3.org/2001/XInclude">
    <property>
        <name>fs.s3a.connection.ssl.enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>fs.s3a.aws.credentials.provider</name>
        <value>
            com.amazonaws.auth.InstanceProfileCredentialsProvider,
            org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
            com.amazonaws.auth.EnvironmentVariableCredentialsProvider
        </value>
    </property>
    <property>
        <name>fs.s3a.endpoint</name>
        <value><填你的s3地址></value>
    </property>

    <property>
        <name>fs.s3a.access.key</name>
        <value>xxxx</value>
    </property>

    <property>
        <name>fs.s3a.secret.key</name>
        <value>xxxx</value>
    </property>
    <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
    </property>
    <property>
        <name>fs.s3a.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>
    <property>
        <name>fs.s3a.fast.upload</name>
        <value>true</value>
    </property>
</configuration>

2.3 Dockerfile

FROM docker.io/flink:1.14.6-scala_2.11-java8

#设置hadoop配置
RUN mkdir -p /usr/local/hadoop/etc/hadoop/
ADD core-site.xml /usr/local/hadoop/etc/hadoop/

#minio的秘钥，也可以使用iam身份令牌，官方更推荐令牌
ENV AWS_ACCESS_KEY_ID xxxx
ENV AWS_SECRET_ACCESS_KEY xxxx
ENV AWS_DEFAULT_REGION cn-northwest-1

# 这些架包需要提前下载，下载方式可以百度
COPY flink/lib/aws-java-sdk-bundle-1.11.375.jar /opt/flink/lib/
COPY flink/lib/commons-cli-1.5.0.jar /opt/flink/lib/
COPY flink/lib/flink-s3-fs-hadoop-1.14.6.jar /opt/flink/lib/
COPY flink/lib/flink-shaded-hadoop-3-uber-3.1.1.7.2.9.0-173-9.0.jar /opt/flink/lib/
COPY flink/lib/guava-27.0-jre.jar /opt/flink/lib/
COPY flink/lib/hadoop-aws-3.2.2.jar /opt/flink/lib/

RUN chown -R flink:flink /opt/flink/lib/*.jar
RUN cd /opt/flink && \

#s3a建议flink-s3-fs-hadoop，s3p建议flink-s3-fs-presto 区别百度
mkdir ./plugins/s3-fs-hadoop && \
cp ./opt/flink-s3-fs-hadoop-1.14.6.jar ./plugins/s3-fs-hadoop/

三、flink启动

3.1 Flink通过Native Kubernetes(k8s)方式Session模式运行部署 (不推荐使用)

3.2 推荐安装streamx，启动flink集群

3.2.1 安装

（类似工具还有dinky，侧重功能点不同）官网文档：框架介绍 | Apache StreamPark (incubating) 说明：Flink流处理工具，单机安装就可以，实际执行还是依赖Flink集群安装过程：省略

3.2.2 配置flink

(1) 配置Flink Home

注：streamx安装的主机，也下载一份flink，例如下载到/opt路径，此配置更多是给standlone模式使用

cd /opt
wget https://archive.apache.org/dist/flink/flink-1.14.6/flink-1.14.6-bin-scala_2.11.tgz
tar -zxvf flink-1.14.6-bin-scala_2.11.tgz

(2) 配置启动flink-cluster集群（session模式）——代替3.1节点

创建工作空间

kubectl create ns flink
kubectl create serviceaccount flink -n flink
kubectl create clusterrolebinding flink-role-bind --clusterrole=edit --serviceaccount=flink:flink

k8s session模式启动, 生产系统可以使用application模式

Dynamic Option 对应flink-conf.yaml的配置

-Dkubernetes.flink.conf.dir=/opt/flink/conf
-Dfs.allowed-fallback-filesystems=s3
-Ds3a.access-key=填你的s3的accesskey
-Ds3a.secret-key=填你的s3的secretkey
-Ds3a.endpoint=填你的s3的endpoint
-Dstate.backend=filesystem
-Dstate.checkpoints.dir=s3a://flink/checkpoints/
-Dstate.backend.fs.checkpointdir=s3a://flink/checkpoints/
-Dstate.savepoints.dir=s3a://flink/savepoints/
-Dstate.backend.fs.savepoints=s3a://flink/savepoints/
-Dkubernetes.jobmanager.cpu=0.2
-Djobmanager.memory.process.size=1024m
-Dresourcemanager.taskmanager-timeout=3600000
-Dkubernetes.taskmanager.cpu=0.2
-Dtaskmanager.memory.process.size=1024m
-Ds3a.connection.ssl.enabled=false
-Ds3.aws.credentials.provider=com.amazonaws.auth.EnvironmentVariableCredentialsProvider
-Dfs.hdfs.hadoopconf=/usr/local/hadoop/etc/hadoop/
-Dfs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
-Ds3a.fast.upload=true
-Ds3a.path.style.access=true
-Dexecution.checkpointing.interval=5000
-Dexecution.checkpointing.mode=EXACTLY_ONCE
-Dexecution.checkpointing.timeout=600000
-Dexecution.checkpointing.min-pause=5000
-Dexecution.checkpointing.max-concurrent-checkpoints=1
-Dstate.checkpoints.num-retained=3
-Dexecution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION

四、 k8s安装trino

4.1 元数据存储

自定义docker镜像编译参考：

https://github.com/joshuarobinson/trino-on-k8s 手把手带你玩转 iceberg - trino on k8s - 文章详情

4.1.1 创建元数据存储pvc

maria_pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: maria-pv-claim
spec:
  storageClassName: trino-storage
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

4.1.2 元数据存储mariadb

** maria_deployment.yaml**

apiVersion: v1
kind: Service
metadata:
  name: metastore-db
  namespace: trino
spec:
  ports:
  - port: 13306
    targetPort: 3306
  selector:
    app: mysql
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql
  namespace: trino
spec:
  selector:
    matchLabels:
      app: mysql
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
      - name: mariadb
        image: "mariadb/server:latest"
        imagePullPolicy: IfNotPresent
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: 123456
        ports:
        - containerPort: 3306
          name: mysql
        volumeMounts:
        - name: mariadb-for-hive
          mountPath: /var/lib/mysql
        resources:
          requests:
            memory: "1G"
            cpu: 0.5
      volumes:
      - name: mariadb-for-hive
        persistentVolumeClaim:
          claimName: maria-pv-claim

4.1.3 元数据存储服务

hive-initschema.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: hive-initschema
  namespace: trino
spec:
  template:
    spec:
      containers:
      - name: hivemeta
        image: zaki297004707/metastore:v1.0.1
        command: ["/opt/hive-metastore/bin/schematool"]
        args: ["--verbose" ,"-initSchema" , "-dbType", "mysql" , "-userName", "root",
          "-passWord", "123456" , "-url", "jdbc:mysql://metastore-db:13306/metastore_db?createDatabaseIfNotExist=true"]
      restartPolicy: Never
  backoffLimit: 4

metastore-cfg

---
kind: ConfigMap
apiVersion: v1
metadata:
  name: metastore-cfg
  namespace: trino
data:
  core-site.xml: |-
    <configuration>
        <property>
            <name>fs.s3a.connection.ssl.enabled</name>
            <value>false</value>
        </property>
        <property>
            <name>fs.s3a.endpoint</name>
            <value>http://xxxx:9000</value>
        </property>
        <property>
            <name>hive.s3a.aws-access-key</name>
            <value>xxx</value>
        </property>
        <property>
            <name>hive.s3a.aws-secret-key</name>
            <value>xxxxxxxxxxxxxx</value>
        </property>
        <property>
            <name>fs.s3a.access.key</name>
            <value>xxx</value>
        </property>
  
        <property>
            <name>fs.s3a.secret.key</name>
            <value>xxxx</value>
        </property>
  
        <property>
            <name>fs.s3a.path.style.access</name>
            <value>true</value>
        </property>
  
        <property>
            <name>fs.s3a.impl</name>
            <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
        </property>
        <property>
            <name>fs.s3a.fast.upload</name>
            <value>true</value>
        </property>
    </configuration>
  metastore-site.xml: |-
    <configuration>
    <property>
            <name>metastore.task.threads.always</name>
            <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value>
    </property>
    <property>
            <name>metastore.expression.proxy</name>
            <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
    </property>
    <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://metastore-db.trino.svc.cluster.local:13306/metastore_db</value>
    </property>
    <property>
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>root</value>
    </property>
    <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>autoai123</value>
    </property>
    <property>
            <name>metastore.warehouse.dir</name>
            <value>s3a://trino/warehouse/</value>
    </property>
    <property>
            <name>metastore.thrift.port</name>
            <value>9083</value>
    </property>
    </configuration>

创建my-s3-keys

apiVersion: v1
kind: Secret
metadata:
   name: my-s3-keys
type:
  Opaque
data:
   access-key: xxxxxxxxxx
   secret-key: xxxxxxxxxxxxxxxxxxx

创建metastore.yaml

---
apiVersion: v1
kind: Service
metadata:
  name: metastore
  namespace: trino
spec:
  ports:
    - port: 9083
  selector:
    app: metastore

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: metastore
  namespace: trino
spec:
  selector:
    matchLabels:
      app: metastore
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: metastore
    spec:
      containers:
        - name: metastore
          image: zaki297004707/metastore:v1.0.1
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: my-s3-keys
                  key: access-key
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: my-s3-keys
                  key: secret-key
          ports:
            - containerPort: 9083
          volumeMounts:
            - name: metastore-cfg-vol
              mountPath: /opt/hive-metastore/conf/metastore-site.xml
              subPath: metastore-site.xml
            - name: metastore-cfg-vol
              mountPath: /opt/hadoop/etc/hadoop/core-site.xml
              subPath: core-site.xml
          command: ["/opt/hive-metastore/bin/start-metastore"]
          args: ["-p", "9083"]
          resources:
            requests:
              memory: "1G"
              cpu: 0.5
          imagePullPolicy: Always
      volumes:
        - name: metastore-cfg-vol
          configMap:
            name: metastore-cfg

4.2、trino

1、新建配置

trino-cfgs.yaml

---
kind: ConfigMap
apiVersion: v1
metadata:
  name: trino-configs
  namespace: trino
data:
  jvm.config: |-
    -server
    -Xmx2G
    -XX:-UseBiasedLocking
    -XX:+UseG1GC
    -XX:G1HeapRegionSize=32M
    -XX:+ExplicitGCInvokesConcurrent
    -XX:+ExitOnOutOfMemoryError
    -XX:+UseGCOverheadLimit
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:ReservedCodeCacheSize=512M
    -Djdk.attach.allowAttachSelf=true
    -Djdk.nio.maxCachedBufferSize=2000000
  config.properties.coordinator: |-
    coordinator=true
    node-scheduler.include-coordinator=false
    http-server.http.port=8080
    query.max-memory=200GB
    query.max-memory-per-node=0.2GB
    query.max-total-memory-per-node=0.6GB
    query.max-stage-count=200
    task.writer-count=4
    discovery-server.enabled=true
    discovery.uri=http://trino-coordinator:8080
  config.properties.worker: |-
    coordinator=false
    http-server.http.port=8080
    query.max-memory=200GB
    query.max-memory-per-node=0.2GB
    query.max-total-memory-per-node=0.6GB
    query.max-stage-count=200
    task.writer-count=4
    discovery.uri=http://trino-coordinator:8080
  node.properties: |-
    node.environment=test
    spiller-spill-path=/tmp
    max-spill-per-node=4TB
    query-max-spill-per-node=1TB
  hive.properties: |-
    connector.name=hive-hadoop2
    hive.metastore.uri=thrift://metastore:9083
    hive.allow-drop-table=true
    hive.max-partitions-per-scan=1000000
    hive.s3.endpoint=10.233.41.1:9001
    hive.s3.path-style-access=true
    hive.s3.ssl.enabled=false
    hive.s3.max-connections=100
  iceberg.properties: |-
    connector.name=iceberg
    hive.metastore.uri=thrift://metastore:9083
    hive.max-partitions-per-scan=1000000
    hive.s3.endpoint=10.233.41.1:9001
    hive.s3.path-style-access=true
    hive.s3.ssl.enabled=false
    hive.s3.max-connections=100
  mysql.properties: |-
    connector.name=mysql
    connection-url=jdbc:mysql://metastore-db.trino.svc.cluster.local:13306
    connection-user=root
    connection-password=autoai123

2、创建服务

trino.yaml

spec:
  selector:
    matchLabels:
      app: trino-coordinator
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: trino-coordinator
    spec:
      containers:
        - name: trino
          image: trinodb/trino:361
          ports:
            - containerPort: 8080
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: my-s3-keys
                  key: access-key
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: my-s3-keys
                  key: secret-key
          volumeMounts:
            - name: trino-cfg-vol
              mountPath: /etc/trino/jvm.config
              subPath: jvm.config
            - name: trino-cfg-vol
              mountPath: /etc/trino/config.properties
              subPath: config.properties.coordinator
            - name: trino-cfg-vol
              mountPath: /etc/trino/node.properties
              subPath: node.properties
            - name: trino-cfg-vol
              mountPath: /etc/trino/catalog/hive.properties
              subPath: hive.properties
            - name: trino-cfg-vol
              mountPath: /etc/trino/catalog/iceberg.properties
              subPath: iceberg.properties
            - name: trino-cfg-vol
              mountPath: /etc/trino/catalog/mysql.properties
              subPath: mysql.properties
          resources:
            requests:
              memory: "1G"
              cpu: 0.5
          imagePullPolicy: Always
      volumes:
        - name: trino-cfg-vol
          configMap:
            name: trino-configs
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trino-worker
  namespace: trino
spec:
  serviceName: trino-worker
  replicas: 1
  selector:
    matchLabels:
      app: trino-worker
  template:
    metadata:
      labels:
        app: trino-worker
    spec:
      securityContext:
        fsGroup: 1000
      containers:
        - name: trino
          image: trinodb/trino:361
          ports:
            - containerPort: 8080
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: my-s3-keys
                  key: access-key
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: my-s3-keys
                  key: secret-key
          volumeMounts:
            - name: trino-cfg-vol
              mountPath: /etc/trino/jvm.config
              subPath: jvm.config
            - name: trino-cfg-vol
              mountPath: /etc/trino/config.properties
              subPath: config.properties.worker
            - name: trino-cfg-vol
              mountPath: /etc/trino/node.properties
              subPath: node.properties
            - name: trino-cfg-vol
              mountPath: /etc/trino/catalog/hive.properties
              subPath: hive.properties
            - name: trino-cfg-vol
              mountPath: /etc/trino/catalog/iceberg.properties
              subPath: iceberg.properties
            - name: trino-cfg-vol
              mountPath: /etc/trino/catalog/mysql.properties
              subPath: mysql.properties
            - name: trino-tmp-data
              mountPath: /tmp
          resources:
            requests:
              memory: "1G"
              cpu: 0.5
          imagePullPolicy: Always
      volumes:
        - name: trino-cfg-vol
          configMap:
            name: trino-configs
  volumeClaimTemplates:
    - metadata:
        name: trino-tmp-data
      spec:
        storageClassName: trino-storage
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 40Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: trino-cli
  namespace: trino
spec:
  containers:
    - name: trino-cli
      image: trinodb/trino:361
      command: ["tail", "-f", "/dev/null"]
      imagePullPolicy: Always
  restartPolicy: Always

4.3 k8s运行

建议字典配置下access-key和sercret-key，桶的权限可能不足

五、示例

5.1 基于streamx使用FlinkSql进行实时流处理入库

（1）FlinkSql

CREATE TABLE IF NOT EXISTS ods_log(
`log` string
) WITH (
'connector' = 'kafka',
'topic' = 'test-log',
'properties.bootstrap.servers' = '<kafka的消息端口>',
'properties.group.id' = 'test',
'scan.startup.mode' = 'earliest-offset',
'format' = 'raw'
);

CREATE CATALOG iceberg WITH (
'type'='iceberg',
'warehouse'='s3a://<>/warehouse/',
'catalog-type'='hive',
'uri'='thrift://xxxx:xxx'
);

CREATE DATABASE IF NOT EXISTS iceberg.test1;

CREATE TABLE IF NOT EXISTS iceberg.test1.ods_filebeat_log(
`log` string
) WITH ('write.format.default'='ORC');


insert into iceberg.test1.ods_filebeat_log select * from ods_log;

（2）参考jar包