背景
使用 pyspark 跑数据的时候,如果 dt 分区过多数据量过大,或者公司集群有内存、时长限制,会导致 job 提前结束
解决方案
先定义 spark 脚本
#!/usr/bin/python
# -*-coding:utf-8 -*-
import datetime
from pyspark.sql import SparkSession
import sys
spark = SparkSession \
.builder \
.enableHiveSupport() \
.config("hive.exec.dynamici.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.getOrCreate()
diff_day = int(sys.argv[1])
dt= datetime.datetime.strftime((datetime.datetime.now() - datetime.timedelta(diff_day)), "%Y%m%d")
df = spark.sql("""select * from table where dt = '{dt}'""".format(dt=dt))
然后写 shell 脚本,循环调用,并传递参数
#!/bin/sh
for diff_day in 1 2 3 4 5 6 7 8 9 10;
do
echo ${diff_day}
sudo -uxxx PYSPARK_PYTHON=./python_env/py37/bin/python3 /xxx/spark-2.3.3-bin-hadoop2.7/bin/spark-submit --master yarn-cluster --executor-memory 18G --conf spark.driver.maxResultSize=4g --num-executors 30 --executor-cores 4 --archives hdfs://xxx/envs/py37.zip#python_env --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python_env/py37/bin/python3 --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./python_env/py37/bin/python3 --conf spark.executorEnv.PYSPARK_PYTHON=./python_env/py37/bin/python3 --conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./python_env/py37/bin/python3 /xxx/test.py >> /xxx/logs/test.log 2>&1 ${diff_day}
done
按照 dt 进行跑数