客户的一些复杂分析以PostgreSQL的存储过程实现,由于数据比较大,执行速度很慢,在决定用PySpark重写的时候,碰到这样一个问题,Spark SQL不支持执行多语句SQL,更不支持存储过程,如果把每个sql都用spark.sql来写,代码会变得很臃肿,另外如果sql发生改动,还需要改写Python代码,有没有简单的办法来执行一个包含多语句的sql脚本,就像spark-sql那样呢?
假设我们有这样一个包含多条sql语句的脚本,名字叫length_of_play.sql,
DROP VIEW IF EXISTS shakespeare;-- Read all of Shakespeare's playsCREATE TEMPORARY VIEW shakespeare USING parquet OPTIONS (path "data/shakespeare.gz.parquet");-- Print the table schema and additional informations to the consoleDESCRIBE EXTENDED shakespeare;-- Calculate number of lines of each work and print to the consoleSELECT play_name, count(line_id) AS lines FROM shakespeare GROUP BY play_name ORDER BY lines DESC LIMIT 20;-- Save the result as one file in JSON Lines formatDROP