Hive（数据库相关）

最新推荐文章于 2024-06-14 15:00:00 发布

2401_85192466

最新推荐文章于 2024-06-14 15:00:00 发布

阅读量389

点赞数 5

分类专栏：作者\/ 文章标签： hive 数据库 hadoop

本文链接：https://blog.csdn.net/2401_85192466/article/details/139187925

版权

作者\/ 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

location:指定表文件的存储路径

建表入门：

use myhive;

create table stu(id int, name string);

insert into stu values(1,“zhangsan”);# 插入数据

select * from stu;

创建表并指定字段之间的分隔符

create table if not exists stu2(id int, name string) row format delimited fields terminated by ‘\t’

创建表并指定表文件的存放路径

create table if not exists stu2(id int, name string) row format delimited fields terminated by ‘\t’ location ‘/user/stu2’

根据查询结果创建表

create table stu3 as select * from stu2 #通过复制表结构和表内容创建新表

根据已经存在的表结构创建表

create table stu4 like stu;

查询表的详细信息

desc formatted stu2;

删除表

drop table stu2;

外部表的操作

外部表说明

外部表因为是指定其他的hdfs路径的数据加载到表当中来，所以hive表会认为自己不完全独占这份数据，所以删除hive表的时候，数据仍然会存放在hdfs当中，不会删掉

内部表和外部表的使用场景

每天将收集到的网站日志定期流入HDFS文本文件，在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过SELECT+INSERT进入内部表

操作案例

分别创建老师与学生表外部表，并向表中加载数据

创建老师表

create external table teacher (t_id string, t_name string) row format delimited fields terminated by ‘\t’

创建学生表

create external table student(s_id string, s_name string, s_birth string)

加载数据

load data local inpath ‘/export/servers/hivedatas/student.csv’ into table student

加载数据并覆盖已有数据

load data local inpath ‘/export/servers/hivedatas/student.csv’ overwrite into table student

从hdfs文件系统向表中加载数据（需要提前将数据上传到hdfs文件系统）

cd /export/servers/hivedatas

hdfs dfs -mkdir -p /hivedatas

hdfs dfs -put teacher.csv /hivedatas/

load data inpath ‘/hivedatas/teacher.csv’ into table teacher;

分区表的操作

在大数据中，最常用的一种思想就是分治，我们可以把大的文件分割划分成一个个的小的文件，这样每次操作一个小的文件就会很容易了，同样的道理，在hive当中也是支持这种思想的，就是我们可以把大的数据，按照每月，或者天进行切分成一个个小的文件，存放在不同的文件夹中。

创建分区表语法

create table score(s_id string, c_id string, s_score int) partitioned by(month string) row format delimited fields terminated by ‘\t’;

创建一个表带多个分区

create table score2(s_id string, c_id string, s_score int) partitioned by(year string, month string, day string) row format delimited dields terminated by ‘\t’;

加载数据到分区表中

load data local inpath ‘/export/servers/hivedatas/score.csv’ into table score partition(month=‘201806’)

多分区表联合查询（使用union all）

select * from score where month = ‘201806’ union all select * from score where month = ‘201806’

查看分区

show partitions score; #score是表名

添加一个分区

alert table score add partition(month = ‘201805’)

删除分区

alert table score drop partition(month = ‘201806’)

进行表的修复（建立表与数据文件的一个关系映射）

msck repair table score4;

分桶表操作

分桶，就是将数据按照指定的字段进行划分到多个文档中去，分桶就是MapReduce中的分区

开启Hive 的分桶功能

set hive.enforce.bucketing=true

设置Reduce个数

set mapreduce.job.reduces=3

创建分桶表

create table course(c_id string, c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by ‘\t’;

桶表的数据加载，由于桶表的数据加载通过hdfs dfs -put文件或者load data均不好使，只能通过 insert overwirte

创建普通表，并通过insert overwrite的方式将普通表的数据通过查询的方式加载到桶表当中去

创建普通表

create table course_common(c_id string, c_name string, t_id string) row format delimited fields terminated by ‘\t’;

普通表中加载数据

load data local inpatj ‘export/servers/hivedatas/course.csv’ into table course_common;

通过insert overwrite给桶表中加载数据

insert overwrite table course select * from course_common cluster by(c_id);

修改表结构

重命名：

alert table old_table_name rename to new_table_name

把表score4修改成score5

alert table score4 rename to score5

增加/修改列信息

查询表结构

desc score5;

添加列

alert table score5 add columns(mycol string, mysco int);

更新列

alert table score5 change column mysco mysconew int;

删除表

drop table score5;

Hive查询语法

SELECT

SELECT [ALL | DISTINCT] select_expr, select_expr,…

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list [HAVING condition]]

[CLUSTER BY col_list

| [DISTRIBUTE BY col_list] [SORT BY|ORDER BY col_list]]

order by会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间
sort by不是全局排序，其在数据进入reduer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序
distribute by(字段)根据指定的字段将数据分到不同的reducer，且分发算法是hash散列
cluster by(字段)除了具有distrubute by的功能外，还会对该字段进行排序

因此，如果distribute和sort字段是同一个，此时，cluster by = distribute by + sort by