spark df api操作

spark 专栏收录该内容
12 篇文章 0 订阅

val df3=df1.join(df2,on字段,连接类型)

1 两个表的关联字段名一样


val df=a11.join(a22,Seq("receive_time","channel_code"))

2 两个表的关联字段名不同 (3个等于号)

 val h5_1=h10_lev3.join(h10_lev2,h10_lev3("parentid_3")===h5_lev2("node2id"),"inner")

3常用操作汇总

 import org.apache.spark.sql.SaveMode
val r88=spark.read.format("jdbc").option("url","jdbc:mysql://localhost:3306/test").option("driver","com.mysql.jdbc.Driver").option("dbtable","test.ccc").option("user","xxx").option("password","xxx").load()

r88.select("itemid","dl_count_30").groupBy(["itemid"]).sum("dl_count_30").orderBy(desc("sum(dl_count_30)")).withColumnRenamed("sum(play_count)","playcount18Q1").limit(500000).write.mode(SaveMode.Overwrite).save("/home/gmd/tmp/playcount18Q1")

r88.write.mode(SaveMode.Overwrite).format("jdbc").option("url","jdbc:mysql://localhost:3306/test").option("driver","com.mysql.jdbc.Driver").option("dbtable","test.cms20190618").option("user","xx").option("password","xxx").save()

4聚合


val tempos2=tempos.join(r91,Seq("relate_itemid"),"inner").agg(sum("counts") as "ultimate_top_play_counts").first().getAs[Long]("ultimate_top_play_counts")

val tempos2_playcounts2=(tempos2+".0").toDouble

5 在写text时,这里要求只有一列,且为字符串类型

import org.apache.spark.sql.SaveMode
a.filter("appid=2980").selectExpr("cast(userid as string)").write.mode(SaveMode.Overwrite).text("/home/gmd/userid2980.txt")

 

  • 1
    点赞
  • 0
    评论
  • 0
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 技术工厂 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值