hive、spark函数大全

莫待花无空折枝

已于 2023-11-15 12:32:54 修改

阅读量1.6k

点赞数

分类专栏：大数据文章标签： hive函数大全

于 2019-09-02 16:30:42 首次发布

本文链接：https://blog.csdn.net/yshysh8/article/details/100309801

版权

大数据专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一、hive

1、日期（hive没有date类型）
1）、获取时间
current_date 2019-05-07
current_timestamp/now() 2019-05-07 15:20:49.247
2）、从日期时间中提取字段
year(''),month(''),day/dayofmonth,hour(''),minute(''),second(''),dayofweek(),dayofyear(),weekofyear()
trunc('2009-02-12', 'MM')截取某部分的日期，其他部分默认为01第二个参数 ["year", "yyyy", "yy", "mon", "month", "mm"]
3）、日期时间转换
unix_timestamp返回当前时间的unix时间戳：
unix_timestamp()当前时间，unix_timestamp('2016-04-08', 'yyyy-MM-dd')。
from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');时间戳对应时间，联合unix_timestamp使用可以把字符串转换成不同格式的日期。
to_date/date()返回时间字符串的日期部分。

4）、日期计算
months_between('1997-02-28 10:30:00', '1996-10-30')两个日期之间的月数
datediff('2009-07-31', '2009-07-30');两个日期之间的天数
add_months('2016-08-31', 1)日期加n月
last_day(date)当月的最后一天
date_add('2016-07-30', 1)
date_sub('2016-07-30', 1)
2、数值函数
round(3.1415926) 返回 BIGINT；round(double a, int d)返回DOUBLE
floor(double a)ceil(double a)
rand()取随机数函数
abs(double a) 绝对值
3、字符串函数

concat_ws(',','abc','def','gh')带分隔符字符串连接函数

upper('abSEd')
lower('abSEd')
trim(' abc ')
regexp_replace('foobar', 'oo|ar', '')正则表达式替换

from_json('{"k": "fruit", "v": "apple"}','k STRING, v STRING', map("",""))
json_tuple('{"name":"jack","server":"www.qq.com"}','server','name')
cast(1 as bigint)类型转换

substr(string A, int start, int len)
substr(string A, int start)

split(regexp_replace(msg,'\\\\x22','"'),' -')[1]

5、默认值coalesce(name,"")
解决join空值中的坑。
6、多行拼接concat_ws
concat_ws('_',collect_set(promotion_id)collect_set的作用是对promotion_id去重

7、解析list为多行

json_extract

select union_id
,get_json_object(tag1,'$.url') audit_content
from (
select union_id
,substr(get_json_object(audit_content,'$.images'),2,length(get_json_object(audit_content,'$.images'))-2) audit_content
from table
WHERE pt = "${bizdate}" and get_json_object(audit_content,'$.images')<>'[]'
) ta lateral view explode(split(audit_content, ',')) r1 as tag1 ;

explore(c_map) b AS map_key, map_value;

presto

select id,new_name from table

cross join unnest(split(name,',')) as tmp(new_name )

alter table tablename add columns(columnname string) ;

删除外部表，先将外部表改为内部表

alter table test_external set TBLPROPERTIES('EXTERNAL'='false');

alter table test_external change column etl_time etl_time string COMMENT '数据时间';

alter table app.example_orc partition (dt="20180505") concatenate;合并小文件

ALTER TABLE my_partition_test_table DROP IF EXISTS PARTITION (p_loctype='MHA');

GET_JSON_OBJECT(tag_cont,'$[*].tag')
alter table my_partition_test_table if not exists add partition (p_hour='2017113003', p_city='573', p_loctype='MHA');

查询除了当前这条数据的其他数据并合并

replace(replace(concat_ws(';',collect_set(concat(commentid,':',hot_score))over()),concat(';',commentid,':',hot_score),''),concat(commentid,':',hot_score,';'),'') as comment_list,

窗口中按指定条件排序：

1、将指定条件的数据查出来添加排序字段；

2、first_value(case when act_semantic='pv_out' then extra_info else null end) over(partition by device_id, sinfo['req_id'], sinfo['object_id'] order by case when act_semantic='pv_out' then log_time else 0 end desc) as extra_info,

split( regexp_replace(regexp_replace(GET_JSON_OBJECT(ext1, '$[*].accounts'), '\\[+\\[+\\{', ''), '\\}+\\]+\\]', ''), '\\}+\\]+\\,+\\[+\\{') )获取list里面所有accounts值

array_contains

array_distinct

cardinality presto查询数组/map大小

spark concat 连接两个数组

to_json map转json

str_to_map json 转map

MSCK REPAIR TABLE 外部表数据复原

解决hive外部表创建_success问题

"mapreduce.fileoutputcommitter.marksuccessfuljobs": false

数据治理

hadoop fs -du -s -h oss://bigdata/data_warehouse/xxxxxxxxx

hadoop fs -rm -r -skipTrash oss://bigdata/data_warehouse/xxxxxxxxx

hadoop fs -ls oss://bigdata/data_warehouse/xxxxxxxxx ｜wc -n 100

二、spark

SparkSession
val conf = new SparkConf().setAppName("SparkWithHive_Sparkv21")
val sparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
sql
spark.sql("") 返回DataFrame
DataFrame方法：
show(n)默认展示20条记录
collect()获取所有数据到数组返回Array对象，map(r=>())
collectAsList()获取所有数据到List
first, head, take, takeAsList：获取若干行记录
where($"")<=>filter($"")条件筛选
select("","")查询指定
distinct()返回一个不包含重复记录的DataFrame
withColumn("period",$"")新增一列，$""取列值，lit(1)给默认值。
withColumnRenamed("原名","新名")
$""的操作：
length($"")
$"".cast(String)
when($""=,)otherwise()
自定义方法
def findHentai(sex:String,dressing:String): String ={
if(sex =="boy" && dressing == "裙子") "变态" else "正常"
}
声明
var check_amt_udf = udf(findHentai _)
调用
check_amt_udf.apply($"",$"")

join
df1.join(df2):笛卡尔积
多个相同字段形式
df1.join(df2, Seq("id","name"))
不相同字段
df1.join(df2 , df1("id" ) === df2( "t1_id"))
指定join类型
df1.join(df2, Seq("id","name"),"left")

groupBy("").agg(
max("").alias(""),
collect_list("") as ""
)

三、spark定义

类 object
方法def f():Unit={}
变量var
常量val
spark 循环
df.foreachPartition(partition => {partition.foreach(line => {line.get(line.fieldIndex(dimensionColumn))})})
var index=Map("t1111dacu_rgst"->"1111dacu_rgst")
index+=("t1212dacu_rgst"->"1212dacu_rgst")
index.foreach{case (k,v) => {}}

四、优化

SparkSQL执行时参数优化 - 不会叫的喵· - 博客园

spark从hdfs里面获取数据分区数是不可设定的，如果后面有复杂的处理逻辑，需要使用宽依赖算子如distinct使stage拆分，再设置partition。或者使用分桶的方法。

spark优化：
spark.conf.set("spark.sql.shuffle.partitions",1000)
df.repartition(1000).persist()

Hive任务优化
1、   map一直等于0%，set hive.auto.convert.join = false转成reduce端的Common Join。
2、   reduce一直是99%，可能是reduce太少(默认是3亿)，SET hive.exec.reducers.bytes.per.reducer=1000000000;单个reduce人处理的数据量或set mapred.reduce.tasks=80;。
3、   hive.groupby.skewindata group by数据倾斜。hive.map.aggr=true 这个配置项代表是否在map端进行聚合
4、   set hive.optimize.skewjoin = true; join数据倾斜set hive.skewjoin.key =4*（count-reducer）
5、   join or 用union all GROUP BY这样替代
6、   Error: Java heap space 设置内存set mapreduce.map.memory.mb=4096; set mapreduce.map.java.opts=-Xmx3600m;

PySpark之列操作 - 简书

莫待花无空折枝

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
hive、spark函数大全

一、hive1、日期（hive没有date类型）1）、获取时间current_date 2019-05-07current_timestamp/now() 2019-05-07 15:20:49.2472）、从日期时间中提取字段year(''),month(''),day/dayofmonth,hour(''),minute(''),second(''),dayofweek(),d...
复制链接

扫一扫