客户的一些复杂分析以PostgreSQL的存储过程实现,由于数据比较大,执行速度很慢,在决定用PySpark重写的时候,碰到这样一个问题,Spark SQL不支持执行多语句SQL,更不支持存储过程,如果把每个sql都用spark.sql来写,代码会变得很臃肿,另外如果sql发生改动,还需要改写Python代码,有没有简单的办法来执行一个包含多语句的sql脚本,就像spark-sql那样呢?
假设我们有这样一个包含多条sql语句的脚本,名字叫length_of_play.sql,
DROP VIEW IF EXISTS shakespeare;-- Read all of Shakespeare's playsCREATE TEMPORARY VIEW shakespeare USING parquet OPTIONS (path "data/shakespeare.gz.parquet");-- Print the table schema and additional informations to the consoleDESCRIBE EXTENDED shakespeare;-- Calculate number of lines of each work and print to the consoleSELECT play_name, count(line_id) AS lines FROM shakespeare GROUP BY play_name ORDER BY lines DESC LIMIT 20;-- Save the result as one file in JSON Lines formatDROP TABLE IF EXISTS lengthOfPlay; -- to overwrite, remove existing tableSET spark.sql.shuffle.partitions=1; -- to make single output fileCREATE TABLE lengthOfPlay USING json LOCATION "length_of_play" AS SELECT play_name, count(line_id) AS lines FROM shakespeare GROUP BY play_name ORDER BY lines DESC;
#spark-sql命令允许执行一次执行多条语句spark-sql -f length_of_play.sql
但这样不方便在PySpark里使用,另外的办法就是把语句拆分出来,循环执行,
#初始化SparkSessionfrom pyspark.sql import SparkSessionspark = SparkSession.builder\.master("local[*]")\.getOrCreate()
执行之前检查下hive里有哪些表,
spark.sql("show tables").show()
#结果+--------+---------+-----------+|database|tableName|isTemporary|+--------+---------+-----------++--------+---------+-----------+
接下来执行sql脚本,
with open("length_of_play.sql") as file: # Use file to refer to the file object queries = file.read().rstrip().split(";") for query in queries: #print("~~~~") #print(query) if len(query.strip())>0: spark.sql(query)
再检查下hive里有哪些表,
spark.sql("show tables").show()
#结果+--------+------------+-----------+|database| tableName|isTemporary|+--------+------------+-----------+| default|lengthofplay| false|| | shakespeare| true|+--------+------------+-----------+
自此,只需要把把上面的核心代码封装下,就可以方便地处理多个脚本了。
P.S.
为了让代码更稳健,最终我采用了Python的package, sqlparse,而且这个包还有别的用处,这些留到以后再说吧。
如果没有特殊说明,脚本和相关资源可以到
https://github.com/alitrack/pyspark
寻找,如果没有,可以给我留言(如果需要),我放上去。
求职招聘,技术交流,请加我们免费的圈子: