组件之Seatunnel 2.3.2 安装与验证(Hive2.1.1 to Doris2.1.1)指南

目录

一、安装 Seatunnel 2.3.2

二、测试数据准备

三、应用Seatunnel执行Hive到Doris同步


        本指南将帮助您完成 Seatunnel 2.3.2 的安装和配置,并执行简单的测试任务。按照以下步骤操作,确保您能够顺利完成 Seatunnel 的安装和验证。

一、安装 Seatunnel 2.3.2

  1. 下载 Seatunnel 2.3.2 压缩包

    运行以下命令下载 Seatunnel 2.3.2 的压缩包:

    wget https://archive.apache.org/dist/seatunnel/2.3.2/apache-seatunnel-2.3.2-bin.tar.gz
  2. 解压文件

    解压下载的文件:

    tar -zxvf apache-seatunnel-2.3.2-bin.tar.gz
  3. 安装 Seatunnel 插件

    运行以下命令安装必要的插件:

    sh bin/install-plugin.sh 2.3.2
  4. 配置 Hive 相关依赖

    请确保将 Hive 相关的 JAR 包(例如 hive-exec-2.1.1.jarseatunnel-hadoop3-3.1.4-uber-2.3.2-optional.jar 等)移动到 lib/ 目录下。

  5. 配置 Spark 环境

    编辑 config/seatunnel-env.sh 文件,设置 SPARK_HOME 环境变量。

二、测试数据准备

  1. 创建原始数据表

    在 Hive 中创建原始数据表 tmp_db.dw_orders,用于存储订单数据。以下是 SQL 创建语句:

    drop table if exists tmp_db.dw_orders;
    CREATE TABLE tmp_db.dw_orders (
      order_id          BIGINT COMMENT '订单ID',
      customer_id       BIGINT COMMENT '客户ID',
      product_id        BIGINT COMMENT '商品ID',
      product_name      STRING COMMENT '商品名称',
      quantity          INT COMMENT '商品数量',
      price             DECIMAL(10, 2) COMMENT '商品单价',
      total_amount      DECIMAL(15, 2) COMMENT '订单总金额',
      discount_amount   DECIMAL(15, 2) COMMENT '折扣金额',
      shipping_fee      DECIMAL(10, 2) COMMENT '运费',
      payment_method    STRING COMMENT '支付方式',
      order_status      STRING COMMENT '订单状态',
      shipping_status   STRING COMMENT '发货状态',
      create_time       TIMESTAMP COMMENT '订单创建时间',
      update_time       TIMESTAMP COMMENT '订单更新时间'
    ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    STORED AS textfile;
  2. 生成订单数据

    运行以下 Python 脚本生成模拟的订单数据,并保存为 CSV 文件:

    import random
    import csv
    from datetime import datetime, timedelta
    
    def generate_order_data(start_date, days=10, orders_per_day=1000000):
      order_statuses = ['已付款', '待发货', '已发货', '已完成', '待付款']
      shipping_statuses = ['待发货', '已发货', '已签收', '派送中']
      payment_methods = ['支付宝', '信用卡', '微信支付', '银行转账']
    
      product_names = ['商品A', '商品B', '商品C', '商品D', '商品E']
    
      data = []
      for day in range(days):
          current_date = start_date + timedelta(days=day)
          dt_partition = current_date.strftime('%Y%m%d')  
          for _ in range(orders_per_day):
              order_id = random.randint(100000000000, 999999999999)
              customer_id = random.randint(1000, 9999)
              product_id = random.randint(1, 5)
              product_name = random.choice(product_names)
              quantity = random.randint(1, 10)
              price = random.uniform(10, 500)
              total_amount = round(quantity * price, 2)
              discount_amount = round(total_amount * random.uniform(0, 0.2), 2)
              shipping_fee = random.uniform(5, 30)
              payment_method = random.choice(payment_methods)
              order_status = random.choice(order_statuses)
              shipping_status = random.choice(shipping_statuses)
              create_time = current_date.strftime('%Y-%m-%d %H:%M:%S')
              update_time = current_date.strftime('%Y-%m-%d %H:%M:%S')
    
              data.append([
                  order_id, customer_id, product_id, product_name, quantity, price,
                  total_amount, discount_amount, shipping_fee, payment_method,
                  order_status, shipping_status, create_time, update_time, dt_partition
              ])
    
      return data
    
    def write_to_csv(data, filename):
      header = [
          'order_id', 'customer_id', 'product_id', 'product_name', 'quantity', 'price',
          'total_amount', 'discount_amount', 'shipping_fee', 'payment_method',
          'order_status', 'shipping_status', 'create_time', 'update_time', 'dt'
      ]
      with open(filename, mode='w') as file:
          writer = csv.writer(file)
          writer.writerow(header)
          writer.writerows(data)
    
    start_date = datetime(2025, 2, 1)
    data = generate_order_data(start_date)
    write_to_csv(data, 'orders_data.csv')
    print("数据生成完成,已保存至 orders_data.csv")

    将数据保存为 orders_data.csv

  3. 加载数据到 Hive 表

    将生成的 CSV 数据加载到 tmp_db.dw_orders 表:

    LOAD DATA LOCAL INPATH '/var/lib/hadoop-hdfs/gawyn/orders_data.csv' INTO TABLE tmp_db.dw_orders;
  4. 创建 Parquet 格式表

    使用以下语句创建一个 Parquet 格式的目标表 tmp_db.dw_orders_parquet

    drop table if exists tmp_db.dw_orders_parquet;
    CREATE TABLE tmp_db.dw_orders_parquet (
      order_id          BIGINT COMMENT '订单ID',
      customer_id       BIGINT COMMENT '客户ID',
      product_id        BIGINT COMMENT '商品ID',
      product_name      STRING COMMENT '商品名称',
      quantity          INT COMMENT '商品数量',
      price             DECIMAL(10, 2) COMMENT '商品单价',
      total_amount      DECIMAL(15, 2) COMMENT '订单总金额',
      discount_amount   DECIMAL(15, 2) COMMENT '折扣金额',
      shipping_fee      DECIMAL(10, 2) COMMENT '运费',
      payment_method    STRING COMMENT '支付方式',
      order_status      STRING COMMENT '订单状态',
      shipping_status   STRING COMMENT '发货状态',
      create_time       TIMESTAMP COMMENT '订单创建时间',
      update_time       TIMESTAMP COMMENT '订单更新时间'
    )
    PARTITIONED BY (dt STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    STORED AS parquet;
  5. 加载数据到 Parquet 表

    将数据从 tmp_db.dw_orders 表加载到 Parquet 格式的 tmp_db.dw_orders_parquet 表:

    SET hive.exec.dynamic.partition = true;
    SET hive.exec.dynamic.partition.mode = nonstrict;
    insert overwrite table tmp_db.dw_orders_parquet partition(dt) 
    select * 
    from tmp_db.dw_orders
    where order_id is not null;
  6. 创建 Doris 目标表

    创建目标表 tmp_db.dw_orders,用于将数据从 Hive 迁移到 Doris:

    drop table if exists tmp_db.dw_orders;
    CREATE TABLE tmp_db.dw_orders (
      dt                BIGINT COMMENT '分区日期',
      order_id          BIGINT COMMENT '订单ID',
      customer_id       BIGINT COMMENT '客户ID',
      product_id        BIGINT COMMENT '商品ID',
      product_name      STRING COMMENT '商品名称',
      quantity          INT COMMENT '商品数量',
      price             DECIMAL(10, 2) COMMENT '商品单价',
      total_amount      DECIMAL(15, 2) COMMENT '订单总金额',
      discount_amount   DECIMAL(15, 2) COMMENT '折扣金额',
      shipping_fee      DECIMAL(10, 2) COMMENT '运费',
      payment_method    STRING COMMENT '支付方式',
      order_status      STRING COMMENT '订单状态',
      shipping_status   STRING COMMENT '发货状态',
      create_time       STRING COMMENT '订单创建时间',
      update_time       STRING COMMENT '订单更新时间'
    )
    PARTITION BY RANGE (dt) (
      PARTITION p20240101 VALUES LESS THAN ("20240201"),
      PARTITION p20240201 VALUES LESS THAN ("20240301")
    )
    DISTRIBUTED BY HASH(order_id) BUCKETS 10
    PROPERTIES (
      "replication_num" = "3"
    );

    下面可以进行seatunnel功能测试了。

三、应用Seatunnel执行Hive到Doris同步

  1. 创建配置文件 h2d.conf,以便用于数据同步任务:

    env {
     parallelism = 1
     job.mode = "BATCH"
     checkpoint.interval = 10000
     spark.app.name = "h2d"
     spark.sql.catalogImplementation = "hive"
     spark.executor.memory= "4g"
     spark.executor.instances = "1"
     spark.executor.memoryOverhead= "2g"
     spark.sql.files.maxPartitionBytes = 134217728
     spark.sql.files.openCostInBytes = 134217728
     spark.kryoserializer.buffer.max="256m"
     hive.exec.dynamic.partition.mode = "nonstrict"
     spark.dynamicAllocation.enabled="true"
     spark.shuffle.service.enabled="true"
    }
    
    source {
     Hive {
       result_table_name = "tmp_hivedata"
       table_name = "tmp_db.dw_orders_parquet"
       read_partitions = ["dt=20240101"]
       metastore_uri = "thrift://hadoop-test:9083"
       hdfs_site_path = "/etc/hadoop/conf/hdfs-site.xml"
     }
    }
    
    transform {
     Sql {
       source_table_name = "tmp_hivedata"
       result_table_name = "tmp_dorisdata"
       query = """
           select 
               dt               as dt,
               order_id         as order_id,
               customer_id      as customer_id,
               product_id       as product_id,
               product_name     as product_name,
               quantity         as quantity,
               price            as price,
               total_amount     as total_amount,
               discount_amount  as discount_amount,
               shipping_fee     as shipping_fee,
               payment_method   as payment_method,
               order_status     as order_status,
               shipping_status  as shipping_status,
               create_time      as create_time,
               update_time      as update_time 
           from tmp_hivedata 
           where order_id is not null 
       """
     }
    }
    
    sink {
       Doris {
           source_table_name = "tmp_dorisdata"
           fenodes = "192.168.100.101:8030"
           username = admin
           password = "gawyn"
           table.identifier = "tmp_db.dw_orders"
           batch_max_rows = 50000
           batch_max_bytes = 104857600
           batch_interval_ms = 30000
           sink.enable-2pc = "false"
           sink.label-prefix = "test_json"
           doris.config = {
               format="json"
               read_json_by_line="true"
           }
       }
    }
  2. 执行测试脚本

    • Seatunnel引擎

      • 启动集群

        bin/seatunnel-cluster.sh -DJvmOption="-Xms4G -Xmx4G"
      • 执行同步任务

        bin/seatunnel.sh --config h2d.conf -e cluster
    • Spark引擎

      • 执行同步任务

        bin/start-seatunnel-spark-2-connector-v2.sh \
        --master yarn \
        --deploy-mode client \
        --config h2d.conf

        使用此配置文件,您可以调度并执行 Seatunnel 作业。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ByteLab

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值