spark-submit 集群，提交job到集群执行

最新推荐文章于 2023-03-31 14:42:34 发布

小菜学习笔记

最新推荐文章于 2023-03-31 14:42:34 发布

阅读量259

点赞数

本文链接：https://blog.csdn.net/weixin_39291305/article/details/103312108

版权

本文介绍如何通过spark-submit命令将Python脚本`get_hive_count.py`和`equal_compare.py`提交到YARN集群进行执行。提交参数包括设置master为yarn，client模式， executor数量、核心和内存配置，并加载hive-site.xml配置文件。

摘要由CSDN通过智能技术生成

python脚本：（get_hive_count.py ）

def get_total_everyDay(from_table,hive_from):
    spark = SparkSession.builder.master("yarn").appName("get %s hive count" %  hive_from).enableHiveSupport().getOrCreate()
    sql = "select part_date, count(*) as count from %s group by part_date order by part_date" % from_table
    spark.sql(sql).repartition(1).write.parquet( "/user/abc/count_compare/%s_count_%s.parquet" % (
        hive_from, from_table.split(".")[1]), mode="overwrite")
    print("get %s finish-table hive count OK!" % from_table)

另外一个脚本：equal_compare.py，读取一个parquet文件

#对结果文件进行比较，有part_date字段的表
def check_result_for_tableHavePartDate(sh_hive_count_path, bj_hive_count_path, from_table):
    print("%s******************start of comparison!******************" % from_table)
    spark = SparkSession.builder.