-》自定义函数
1)创建工程,加载hive的依赖包
2)编写代码,需要继承UDF
3)打包 export jar file
4)双传jar包到linux目录下
5)启动hive
6)add jar jar路径 //不要加引号
add jar /root/lower.jar
7)关联到hive中
create temporary function 自定义函数名 as '包的函数名'
create temporary function lower as "com.alex.udf.func.lower";
OK
Time taken: 0.1 seconds
hive (default)> use month
month( months_between(
hive (default)> use mongdb;
OK
Time taken: 0.031 seconds
hive (mongdb)> select * from student;
OK
student.id student.name
4 Tonny
1 Alex
2 Amy
3 Mia
NULL NULL
Time taken: 1.647 seconds, Fetched: 5 row(s)
hive (mongdb)> select name, lower(name) as lower_name from student;
OK
name lower_name
Tonny tonny
Alex alex
Amy amy
Mia mia
NULL NULL
Time taken: 0.31 seconds, Fetched: 5 row(s)
-》压缩:
1》开启压缩
set hive.exec.compress.intermediate;
set hive.exec.compress.intermediate = true;
2>map开启
hive (default)>set hive.exec.compress.intermediate;
hive.exec.compress.intermediate=false
hive (default)> set hive.exec.compress.intermediate=true;
hive (default)> set mapreduce.map.output.compress;
mapreduce.map.output.compress=false
hive (default)> set mapreduce.map.output.compress=true;
3》reduce开启
开启最终输出压缩功能
set hive.exec.conpress.output=true
开启最终数据压缩功能
mapreduce.output.fileoutputformat.compress=true;
设置压缩方式:
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
设置块压缩:
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
-》设置存储格式:
创建表的时候加入
create table emp_num(time int,host string)
row format
delimited fields
terminated by ‘\t’
stored as orc; //指定存储格式
TextFile /SequenceFile/orc/Parquet
orc:Index Data/row Data /stripe Footer
压缩比:
orc》parquet》textfile
orc> testfile (50s > 54s)
-》数据倾斜优化:
set hive.map.aggr;
hive.map.aggr=true
a)设置负载均衡:
set hive.groupby.skewindata;
hive.groupby.skewindata=false
b)合并小文件
set hive.input.format;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
JVM重用:
mapred-site.xml
mapreduce.job.jvm.numtasks=10~20