目录
本指南将帮助您完成 Seatunnel 2.3.2 的安装和配置,并执行简单的测试任务。按照以下步骤操作,确保您能够顺利完成 Seatunnel 的安装和验证。
一、安装 Seatunnel 2.3.2
-
下载 Seatunnel 2.3.2 压缩包
运行以下命令下载 Seatunnel 2.3.2 的压缩包:
wget https://archive.apache.org/dist/seatunnel/2.3.2/apache-seatunnel-2.3.2-bin.tar.gz
-
解压文件
解压下载的文件:
tar -zxvf apache-seatunnel-2.3.2-bin.tar.gz
-
安装 Seatunnel 插件
运行以下命令安装必要的插件:
sh bin/install-plugin.sh 2.3.2
-
配置 Hive 相关依赖
请确保将 Hive 相关的 JAR 包(例如
hive-exec-2.1.1.jar
、seatunnel-hadoop3-3.1.4-uber-2.3.2-optional.jar
等)移动到lib/
目录下。 -
配置 Spark 环境
编辑
config/seatunnel-env.sh
文件,设置SPARK_HOME
环境变量。
二、测试数据准备
-
创建原始数据表
在 Hive 中创建原始数据表
tmp_db.dw_orders
,用于存储订单数据。以下是 SQL 创建语句:drop table if exists tmp_db.dw_orders; CREATE TABLE tmp_db.dw_orders ( order_id BIGINT COMMENT '订单ID', customer_id BIGINT COMMENT '客户ID', product_id BIGINT COMMENT '商品ID', product_name STRING COMMENT '商品名称', quantity INT COMMENT '商品数量', price DECIMAL(10, 2) COMMENT '商品单价', total_amount DECIMAL(15, 2) COMMENT '订单总金额', discount_amount DECIMAL(15, 2) COMMENT '折扣金额', shipping_fee DECIMAL(10, 2) COMMENT '运费', payment_method STRING COMMENT '支付方式', order_status STRING COMMENT '订单状态', shipping_status STRING COMMENT '发货状态', create_time TIMESTAMP COMMENT '订单创建时间', update_time TIMESTAMP COMMENT '订单更新时间' ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;
-
生成订单数据
运行以下 Python 脚本生成模拟的订单数据,并保存为 CSV 文件:
import random import csv from datetime import datetime, timedelta def generate_order_data(start_date, days=10, orders_per_day=1000000): order_statuses = ['已付款', '待发货', '已发货', '已完成', '待付款'] shipping_statuses = ['待发货', '已发货', '已签收', '派送中'] payment_methods = ['支付宝', '信用卡', '微信支付', '银行转账'] product_names = ['商品A', '商品B', '商品C', '商品D', '商品E'] data = [] for day in range(days): current_date = start_date + timedelta(days=day) dt_partition = current_date.strftime('%Y%m%d') for _ in range(orders_per_day): order_id = random.randint(100000000000, 999999999999) customer_id = random.randint(1000, 9999) product_id = random.randint(1, 5) product_name = random.choice(product_names) quantity = random.randint(1, 10) price = random.uniform(10, 500) total_amount = round(quantity * price, 2) discount_amount = round(total_amount * random.uniform(0, 0.2), 2) shipping_fee = random.uniform(5, 30) payment_method = random.choice(payment_methods) order_status = random.choice(order_statuses) shipping_status = random.choice(shipping_statuses) create_time = current_date.strftime('%Y-%m-%d %H:%M:%S') update_time = current_date.strftime('%Y-%m-%d %H:%M:%S') data.append([ order_id, customer_id, product_id, product_name, quantity, price, total_amount, discount_amount, shipping_fee, payment_method, order_status, shipping_status, create_time, update_time, dt_partition ]) return data def write_to_csv(data, filename): header = [ 'order_id', 'customer_id', 'product_id', 'product_name', 'quantity', 'price', 'total_amount', 'discount_amount', 'shipping_fee', 'payment_method', 'order_status', 'shipping_status', 'create_time', 'update_time', 'dt' ] with open(filename, mode='w') as file: writer = csv.writer(file) writer.writerow(header) writer.writerows(data) start_date = datetime(2025, 2, 1) data = generate_order_data(start_date) write_to_csv(data, 'orders_data.csv') print("数据生成完成,已保存至 orders_data.csv")
将数据保存为
orders_data.csv
。 -
加载数据到 Hive 表
将生成的 CSV 数据加载到
tmp_db.dw_orders
表:LOAD DATA LOCAL INPATH '/var/lib/hadoop-hdfs/gawyn/orders_data.csv' INTO TABLE tmp_db.dw_orders;
-
创建 Parquet 格式表
使用以下语句创建一个 Parquet 格式的目标表
tmp_db.dw_orders_parquet
:drop table if exists tmp_db.dw_orders_parquet; CREATE TABLE tmp_db.dw_orders_parquet ( order_id BIGINT COMMENT '订单ID', customer_id BIGINT COMMENT '客户ID', product_id BIGINT COMMENT '商品ID', product_name STRING COMMENT '商品名称', quantity INT COMMENT '商品数量', price DECIMAL(10, 2) COMMENT '商品单价', total_amount DECIMAL(15, 2) COMMENT '订单总金额', discount_amount DECIMAL(15, 2) COMMENT '折扣金额', shipping_fee DECIMAL(10, 2) COMMENT '运费', payment_method STRING COMMENT '支付方式', order_status STRING COMMENT '订单状态', shipping_status STRING COMMENT '发货状态', create_time TIMESTAMP COMMENT '订单创建时间', update_time TIMESTAMP COMMENT '订单更新时间' ) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS parquet;
-
加载数据到 Parquet 表
将数据从
tmp_db.dw_orders
表加载到 Parquet 格式的tmp_db.dw_orders_parquet
表:SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; insert overwrite table tmp_db.dw_orders_parquet partition(dt) select * from tmp_db.dw_orders where order_id is not null;
-
创建 Doris 目标表
创建目标表
tmp_db.dw_orders
,用于将数据从 Hive 迁移到 Doris:drop table if exists tmp_db.dw_orders; CREATE TABLE tmp_db.dw_orders ( dt BIGINT COMMENT '分区日期', order_id BIGINT COMMENT '订单ID', customer_id BIGINT COMMENT '客户ID', product_id BIGINT COMMENT '商品ID', product_name STRING COMMENT '商品名称', quantity INT COMMENT '商品数量', price DECIMAL(10, 2) COMMENT '商品单价', total_amount DECIMAL(15, 2) COMMENT '订单总金额', discount_amount DECIMAL(15, 2) COMMENT '折扣金额', shipping_fee DECIMAL(10, 2) COMMENT '运费', payment_method STRING COMMENT '支付方式', order_status STRING COMMENT '订单状态', shipping_status STRING COMMENT '发货状态', create_time STRING COMMENT '订单创建时间', update_time STRING COMMENT '订单更新时间' ) PARTITION BY RANGE (dt) ( PARTITION p20240101 VALUES LESS THAN ("20240201"), PARTITION p20240201 VALUES LESS THAN ("20240301") ) DISTRIBUTED BY HASH(order_id) BUCKETS 10 PROPERTIES ( "replication_num" = "3" );
下面可以进行seatunnel功能测试了。
三、应用Seatunnel执行Hive到Doris同步
-
创建配置文件
h2d.conf
,以便用于数据同步任务:env { parallelism = 1 job.mode = "BATCH" checkpoint.interval = 10000 spark.app.name = "h2d" spark.sql.catalogImplementation = "hive" spark.executor.memory= "4g" spark.executor.instances = "1" spark.executor.memoryOverhead= "2g" spark.sql.files.maxPartitionBytes = 134217728 spark.sql.files.openCostInBytes = 134217728 spark.kryoserializer.buffer.max="256m" hive.exec.dynamic.partition.mode = "nonstrict" spark.dynamicAllocation.enabled="true" spark.shuffle.service.enabled="true" } source { Hive { result_table_name = "tmp_hivedata" table_name = "tmp_db.dw_orders_parquet" read_partitions = ["dt=20240101"] metastore_uri = "thrift://hadoop-test:9083" hdfs_site_path = "/etc/hadoop/conf/hdfs-site.xml" } } transform { Sql { source_table_name = "tmp_hivedata" result_table_name = "tmp_dorisdata" query = """ select dt as dt, order_id as order_id, customer_id as customer_id, product_id as product_id, product_name as product_name, quantity as quantity, price as price, total_amount as total_amount, discount_amount as discount_amount, shipping_fee as shipping_fee, payment_method as payment_method, order_status as order_status, shipping_status as shipping_status, create_time as create_time, update_time as update_time from tmp_hivedata where order_id is not null """ } } sink { Doris { source_table_name = "tmp_dorisdata" fenodes = "192.168.100.101:8030" username = admin password = "gawyn" table.identifier = "tmp_db.dw_orders" batch_max_rows = 50000 batch_max_bytes = 104857600 batch_interval_ms = 30000 sink.enable-2pc = "false" sink.label-prefix = "test_json" doris.config = { format="json" read_json_by_line="true" } } }
-
执行测试脚本
-
Seatunnel引擎
-
启动集群
bin/seatunnel-cluster.sh -DJvmOption="-Xms4G -Xmx4G"
-
执行同步任务
bin/seatunnel.sh --config h2d.conf -e cluster
-
-
Spark引擎
-
执行同步任务
bin/start-seatunnel-spark-2-connector-v2.sh \ --master yarn \ --deploy-mode client \ --config h2d.conf
-
-
使用此配置文件,您可以调度并执行 Seatunnel 作业。