ss.sql(
s"""
insert overwrite table test.test1 partition(day='$Day', hour)
select
adx,
geo_country as country,
connectiontype as network,
imptype as adType,
devicetype as deviceType,
platform as trafficType,
lower(os) as os,
concat(imp_w, "*", imp_h) as adSize,
sum(case when log_type = 'bid' then 1 else 0 end) as bid_cnt,
sum(case when log_type = 'nobid' then 1 else 0 end) as blacklist_bid_cnt,
hour
from
test.test
where day = '$Day'
group by hour, adx, platform,geo_country,connectiontype, imptype,devicetype, lower(os), concat(imp_w, "*", imp_h)
"""
)
执行完上面的sql后,spark ui如下图
,spark ui 的excutor都dead了,只要driver 节点是active的,
如图:
咋上面的ss.sql(),后面还有任务要执行,但是就一直卡着不动,不知道为什么,卡了30多分钟,
注意上面的语句有:insert overwrite table test.test1 partition(day=’$Day’, hour)
insert overwrite table test.test1 partition(day='$Day', hour)
这个语句就是把执行的select出来的结果写到test1表中,因为后台在复制,所以,在spark ui上看不到动静,等几十分钟后,复制完了,就該后面的任务进入sparkui界面了
如下图 0,1步骤8分钟就执行完了,等了足足20多分钟,2.3任务才开始跑
所以insert overwrite语句最好别用,那么问题来了,我要实现insert overwrite怎么办?方法如下:
//删除表分区
ss.sql(s"ALTER TABLE test.test1 DROP IF EXISTS PARTITION (day=$day)")
//通过select语句,得到dataframe
val stuDf = ss.sql(querySql).persist(StorageLevel.MEMORY_AND_DISK_SER)
//直接写dataframe内容到hdfs中
stuDf.write.format("orc").save(filePath)
//常见分区
ss.sql(s"ALTER TABLE test.test1 ADD IF NOT EXISTS PARTITION (day=$day) location '$filePath'")
//释放dataframe
stuDf.unpersist(true)