对比pyspark处理数据与hive处理数据

最新推荐文章于 2023-03-28 20:06:52 发布

一条水里的鱼

最新推荐文章于 2023-03-28 20:06:52 发布

阅读量690

点赞数

分类专栏：工具使用文章标签： pysaprk hive

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_40859560/article/details/106406075

版权

工具使用专栏收录该内容

18 篇文章 2 订阅

订阅专栏

过滤：

hive: select * from table where 'age'>1

spark: df=df.filter(df.age>1)

过滤空值：

select * from table where age is not null

df=df.filter(df.age.isNotNull())

选择某字段的最大值：

select max(age) max_age from table

df=df.agg(F.max(df.age)).withColumnRenamed('max(age)', 'max_age')

计算数据量：

select count(1) count from table

df.agg(F.countDistinct(df.name).alias('count'))

df.count()/df.distinct().count()#返回int

选择某几个字段：

select c1,c2 from table /select * from table

df=df.select('c1','c2')/df=df.select('*')

去掉重复数据：

select c1,c2,max(c3) c3 ... from group by c1,c2

df=df.dropDuplicates(['name', 'height'])

join:

select a.c1,a.c2,a.c3,b.c4 from table1 a left join table2 b on a.c1=b.c1 and a.c2=b.c2 where b.c4 is not null

df=df1.join(df2,[df1.c1==df2.c1,df1.c2==df2.c2],'left').select(df1['*'],df2.c4).filter(df2.c4.isNotNull())

limit:

select * from table limit 10

df=df.limit(10)

排序：

select * from table order by age desc

df=df.sort(df.age.desc())/df=df.orderBy('age',ascending=0)

多字段排序

select * from table order by age desc，score asc

df=df.orderBy(['age','score'],ascending=[0,1])

拼接：

select * from a union all select * from b

df=df1.union(df2)

分组聚合再排序：

select name,sum(socre) score from table group by name order by score

ddf1=ddf.groupBy('name').agg(F.sum('score')).withColumnRenamed('sum(score)','score') #先分组聚合

dff1=ddf1.orderBy('score',ascend=1) #再排序

ddf1=ddf.groupby('name').count().orderBy('count')#待验证，感觉和和 select name ,count(1) c from table group by name 结果有些不一样

in:

select * from table where name in ('a','ab')

df=df.filter(df.name.isin('a','ab'))

like:

select * from table where name like '%a%'

df=df.filter(df.name.like('%a%'))

if:

select if (score<10,0,100) score from table

df=df.select(F.when(df.score<10,0).otherwise(100))

coalesce:

select * ,coalesce(name1,name2,'xx') from table

df=df.select('*',F.coalesce(df.name1,df.name2,F.lit('xx')))

concat:

select concat_ws('-',name1,name2) name from table

df=df.select(F.concat_ws('-',df.name1,df.name2).alias('name'))

length:

select col, length(col) len from table

df=df.select(df.col,F.length(df.col).alias('length'))

一条水里的鱼

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
对比pyspark处理数据与hive处理数据

过滤：hive: select * from table where 'age'>1spark: df=df.filter(df.age>1)过滤空值：select * from table where age is not nulldf=df.filter(df.age.isNotNull())选择某字段的最大值：select max(age) max_age from tabledf=df.agg(F.max(df.age)).withColumnRenam.
复制链接

扫一扫

专栏目录

一条水里的鱼 CSDN认证博客专家 CSDN认证企业博客

码龄7年

104: 原创

2338: 周排名

1万+: 总排名

28万+: 访问

: 等级

2344: 积分

1万+: 粉丝

350: 获赞

53: 评论

941: 收藏

私信

关注

分类专栏

最新评论

JDK的下载、安装和配置
CSDN-Ada助手: 哇, 你的文章质量真不错，值得学习！不过这么高质量的文章, 还值得进一步提升, 以下的改进点你可以参考下: (1)增加条理清晰的目录；(2)增加除了各种控件外，文章正文的字数；(3)提升标题与正文的相关性。
IDEA如何配置 Gradle（详细版）
C99C89: 是的，这文章真的参差不齐，都不知道博主自己测了没
大模型推理性能优化之KV Cache解读
CSDN-Ada助手: 恭喜你这篇博客进入【CSDN每天值得看】榜单，全部的排名请看 https://bbs.csdn.net/topics/618473253。
大模型推理性能优化之KV Cache解读
CSDN-Ada助手: 恭喜你这篇博客进入【CSDN每天值得看】榜单，全部的排名请看 https://bbs.csdn.net/topics/618473356。
NLP篇【02】白话Word2vec原理以及层softmax、负采样的实现
m0_46495578: 赞写的很棒

大家在看

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。