hive基础入门

最新推荐文章于 2024-10-11 20:30:00 发布

暴风雨来临的前夕

最新推荐文章于 2024-10-11 20:30:00 发布

阅读量1.3k

点赞数

分类专栏：大数据与云计算文章标签： hive基础大数据

本文链接：https://blog.csdn.net/qq_40695037/article/details/81369174

版权

大数据与云计算专栏收录该内容

1 篇文章 0 订阅

订阅专栏

现在还没有上项目，就将自己以前自学大数据里关于hive方面基础的东西整理拉下，也算是对学过的知识的一种复习，顺便分享出来与各位大佬共勉，有问题的话请不吝赐教。

1.什么是hive

hive是基于hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能。
元数据存储：Hive将元数据存储在数据库中，目前只支持mysql,derby。hive中的元数据包括表的名字，表的列和分区及其属性（是否为外部表），表的数据所在目录。

2.Hive基本属性

创建分区表

create external table if not exists log(
    empno int,
    ename string,
    job string,
    deptno int
)
partitioned by (month String)
row format delimited
fields terminated by '\t';

添加数据

load data local inpath '/home/hadoop/data/log' into table log partition(month='201509');

查询一个分区的数据

select * from log where month = '201509';

分桶表

对于每一个表或者分区，hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。hive是针对某一列进行分桶。
hive采用列值hash,然后除以桶的个数求余的方式决定该条记录放在哪个桶中。
分桶的好处是可以获得更高的查询处理效率。使取样更高效

create table bucked_user(
    id int,
    name string
)
clustered by(id) into 4 buckets
row format delimited fields terminated by '\t'
stored as textfile;

另外一个问题就是使用桶表的时候要开启桶表; set hive.enforce.bucketing = true;

分桶中的排序

order by : 全局排序，一个reduce
sort by :每个reduce内部进行排序，全局不排序
distribute by:类似mapreduce中的partition进行分区，结合sort by使用
cluster by ：当distribute和sort字段相同时，使用此方式。注意事项：distribute by必须要在sort后面

hive中的数据的导入与导出

数据导入

基本方式

load data[local] inpath 'filepath' [overwrite] into 
table tableName [partition (partcol1 = val1,partcol2 = val2)]

否则是hdfs的路径 [overwrite]：对表的数据是否覆盖

创建表时通过insert加载

create table default.cli like emp;
insert into table default.cli 
select * from default.emp;

创建外部表时，指定location的位置

数据导出

导出本地/hdfs

insert overwrite [local] directory  '/user/data'
select * from default.emp;

hive中查询语句语法

SELECT[ALL|DISTINCT] select_expr,select_expr,...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[CLUSTER BY col_list | DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number]

hive的udf编程

编程步骤 1.继承org.apache.hadoop.hive.ql.UDF 2.需要实现evaluate函数，evaluate函数支持重载；

public class ToLowerCase extends UDF{
    public String evaluate(String field){
        String result = field.toLowerCase();
        return result;
    }
}

注意事项： UDF 必须有返回值类型，可以返回 null, 但是返回值类型不能为 void;

如何使用： add jar 'filepath' create temporary function my_low as '类路径'