hive全操作

最新推荐文章于 2024-08-02 00:27:30 发布

Lone--Wolf

最新推荐文章于 2024-08-02 00:27:30 发布

阅读量304

点赞数 1

分类专栏：技术分享知识分享文章标签：数据库 hive

本文链接：https://blog.csdn.net/qq_39140816/article/details/105241390

版权

技术分享同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

知识分享

2 篇文章 0 订阅

订阅专栏

本文详细介绍Hive SQL的基础操作，包括数据库和表的创建、管理，数据类型介绍，建表语句详解，以及数据导入、查询、更新和删除等常见操作。此外，还介绍了Hive的高级特性如分区表、分桶表的创建与使用，以及数据导出、表结构复制等实用技巧。

摘要由CSDN通过智能技术生成

//登陆hive WBE页面
sh $HIVE_HOME/bin/hive --serviece hwi

1、建库
create database mydb;
create database if no exists mydb;
create database if no exists mydb location “/aa/bb”;

3、删除数据库
drop database mydb;
drop database if exists mydb;
drop database if exists mydb [restrict|cascade]; 非空库

4、先进入我们要操作的数据库/切换库
use mydb;
查看正在使用的库：select current_database();
查看库信息：desc database;

5、支持的数据类型
string int
tinyint smallint
bigint boolean
double float

6、添加建表语句

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name      -- 建(外部)表,(存在删除表)
  [(col_name data_type [COMMENT col_comment], ...)]     	-- 字段
  [COMMENT table_comment]                             			  -- 表描述
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]    -- 分区
  [CLUSTERED BY (col_name, col_name, ...)                         -- 桶
  [SORTED BY (col_name [ASC|DESC], ...)]				--排序
  [INTO num_buckets BUCKETS]						--分桶个数
  [ROW FORMAT row_format]   						 -- 指定列分隔符
  [STORED AS file_format]   							 -- 加载的数据类型,压缩
  [LOCATION hdfs_path]     							  -- 文件位置

//创建内部表（Managered_Table）
create table mingxing_mng(id int, name string, sex string, age int, department string) row format delimited fields terminated by ',';

//创建外部表（External_Table）
create external table mingxing_ext(id int, name string, sex string, age int, department string) row format delimited fields terminated by ',' location '/home/hadoop/hivedata';
注意：创建外部表的时候指定location的位置必须是目录，不能是单个文件

create external table user_info(
    id int,
    address array<string>,
    score map<string,int>,
    info struct<age:int, sex:string,hobby:string>
)
row format delimited fields terminated by '\t'    // 字段分隔符
collection items terminated by ','                // 集合之间分隔符
map keys terminated by ':'                        // key/value分隔符
lines terminated by '\n';                         // 行分隔符

跟内部表对比：
1、在创建表的时候指定关键字： external
2、一般来说，创建外部表，都需要指定一个外部路径

不管是创建外部表还是内部表，都可以指定数据存储目录
默认的目录：
hdfs://hadoop02:9000/user/hive/warehouse/myhive.db/student/student.txt

//创建分区表
create table mingxing_ptn(id int, name string, sex string, age int, department string) partitioned by (city string) row format delimited fields terminated by ',';
注意：partitioned里的字段不能是表中声明的字段,，必须是一个新字段

//创建分桶表
	create table mingxing_bck(id int, name string, sex string, age int, department string) clustered by(id) sorted by(age desc) into 4 buckets row format delimited fields terminated by ',';
	注意：clustered里的字段必须要是表字段中出现的字段
	分桶字段
	表字段
	分桶字段和排序字段可以不一样，分桶字段和排序字段都必须是表字段中的一部分
	分桶的原理和MapReduce的HashPartitioner的原理一致

7、删除表
drop table mingxing;
drop table if exists mingxing;

8、对表进行重命名
alter table mingxing rename to student;

9、对表的字段进行操作（增加add，删除drop，修改change，替换replace）
增加字段，cascade刷新元数据：
alter table mingxing add columns (province string comment ‘省份’) cascade;
alter table mingxing add columns (province string, salary bigint);

删除字段：
drop（不支持） XXXXX

修改字段：
alter table mingxing change age newage string;				// 修改字段的定义
alter table mingxing change age newage string after id;		// 修改字段的定义 + 顺序
alter table mingxing change age newage string first;			// 修改age字段为第一个字段

替换字段
alter table mingxing replace columns(id int, name string, sex string);		// 替换字段?

10、对表中的分区进行操作
增加分区：
alter table mingxing_ptn add partition(city=‘beijing’);
alter table mingxing_ptn add partition(city=‘beijing’) partition(city=‘tianjin’);

alter table mingxing_ptn add partition(city='beijing', email="abc@163.com");
alter table mingxing_ptn add partition(city='beijing', email="abc@163.com") partition(city='tianjin', email="cba@163.com");

删除分区：
alter table mingxing_ptn drop partition(city='beijing');
alter table mingxing_ptn drop partition(city='beijing'), partition(city='tianjin');

修改分区路径：
alter table mingxing_ptn partition(city="beijing") set location "/mingxing_beijing";

11、查询显示命令
查看库：show databases;
查看表：show tables;
查询建库的详细信息：show create database mydb;
查看建表结构：show create table mingxing_mng;
查看某库中的表：show tables in mydb;
查看以s开头的表：show tables like ‘s*’;
查看内置函数库：show functions;
查看分区：show partitions mingxing_ptn;
查看函数的详细手册：desc function extended concat;
查询库详细信息：desc database [extended] mydb;
查看表的字段：desc mingxing_mng;
查看表的详细信息：desc extended mingxing_mng;
查看表的格式化了之后的详细信息：desc formatted mingxing_mng;

方法1：查看表的字段信息
desc table_name;
方法2：查看表的字段信息及元数据存储路径
desc extended table_name;
方法3：查看表的字段信息及元数据存储路径
desc formatted table_name;

12、load方式导入数据
导入本地相对路径的数据
load data local inpath ‘./student.txt’ into table mingxing;
load data local inpath ‘./student.txt’ overwrite into table mingxing;
(覆盖导入)

导入本地绝对路径数据：
load data local inpath '/home/hadoop/hivedata/student.txt' into table mingxing;

导入HDFS上的简便路径数据：
load data inpath '/home/hadoop/hivedata/student.txt' into table mingxing;

导入HDFS上的全路径模式下的数据：
load data inpath 'hdfs://hadoop01:9000/home/hadoop/hivedata/student.txt' into table mingxing;

导入本地数据和导入HDFS上的数据的区别：
1、导入HDFS上的数据到hive表，表示截切，移动
2、导入本地数据，相当于复制或者上传

13、利用insert关键字往表中插入数据
单条数据插入：
insert into table mingxing values(001,‘huangbo’,‘male’,50,‘MA’);

单重插入模式： insert ... select ....
insert into table student select id,name,sex,age,department from mingxing;
注意：查询出的字段必须是student表中存在的字段

多重插入模式：
from mingxing
insert into table student1 select id,name,sex,age
insert into table student2 select id,department;

from mingxing2
insert into table student1 partition(department='MA') select id,name,sex ,age where department='MA'
insert into table student1 partition(department='CS') select id,name,sex ,age where department='CS'; 

静态分区插入：
需要手动的创建分区
alter table student add partition (city="zhengzhou")
load data local inpath '/root/hivedata/student.txt' into table student partition(city='zhengzhou');

动态分区插入： 
打开动态分区的开关：set hive.exec.dynamic.partition = true;
设置动态分区插入模式：set hive.exec.dynamic.partition.mode = nonstrict

create table student(name string, department string) partitioned by (id int) .....
insert into table student partition(id) select name,department,id from mingxing2;
student表字段：name,department， 分区字段是id
查询字段是：name,department,id，分区字段
注意：动态分区插入的分区字段必须是查询语句当中出现的字段中的最后一个

CTAS(create table ... as select ...)(直接把查询出来的结果存储到新建的一张表里)
create table student as select id,name,age,department from mingxing;
注意：自动新建的表中的字段和查询语句出现的字段的名称，类型，注释一模一样

限制：
1、不能创建外部表
2、不能创建分区表
3、不能创建分桶表

分桶插入：

创建分桶表：
create table mingxing(id int, name string, sex string, age int, department string)
clustered by(id) sorted by(age desc) into 4 buckets
row format delimited fields terminated by ',';

插入数据：
insert into table mingxing select id,name,sex,age,department from mingxing2
distribute by id sort by age desc;
注意：查询语句中的分桶信息必须和分桶表中的信息一致

14、like关键字使用：复制表结构
create table student like mingxing;

15、利用insert导出数据到本地或者hdfs
单模式导出数据到本地：
insert overwrite local directory ‘/root/outputdata’ select id,name,sex,age,department from mingxing;

多模式导出数据到本地：
from mingxing
insert overwrite local directory '/root/outputdata1' select id, name
insert overwrite local directory '/root/outputdata2' select id, name,age

简便路径模式导出到hdfs：
insert overwrite directory '/root/outputdata' select id,name,sex,age,department from mingxing;

全路径模式查询数据到hdfs：
insert overwrite directory 'hdfs://hadoop01:9000/root/outputdata1' select id,name,sex,age,department from mingxing;

local ：导出到本地目录
overwrite ：表示覆盖

16、清空数据库表中的数据
truncate table mingxing2;

17、select查询
order by : 全局排序
如果一个HQL语句当中设置了order by，那么最终在HQL语句执行过程中设置的
set mapreduce.job.reduces = 4 不起作用。！！
对输出做全局排序，只有一个reducer，会导致当输入规模较大时，消耗计算时间长。

sort by ：局部排序
一般来说，要搭配 分桶操作使用
distribute by id sort by age desc;

distribute by : 纯粹就是分桶
在使用distribute by的时候：要设置reduceTask的个数

通过set mapred.reduce.tasks=n来指定

cluster by ： 既分桶，也排序，默认倒序，不能指定排序规则。
cluster by age = distribute by age sort by age;
distribute by age sort by age,id != cluster by age sort by id

cluster by 和 sort by 不能同时使用

sql判断字段是否为中文、字母、数字
ascii (col)
数字：48 - 57
字母：65 - 123
中文：123 +
select * from table_name where ascii(game_server) > 123 //判断中文
select * from table_name where ascii(role_id) between 48 and 57 //判断数字

18、join查询
限制:
支持等值连接，不支持非等值连接
支持 and 操作，不支持 or
支持超过2个表的连接

经验：
	当出现多个表进行连接时，最好把小表放置在前面！！ 把大表放置在最后

join分类；
	inner join		内连接
	left outer join	左外连接
	right outer join	右外连接
	full outer join	全连接
	left semi join	左半连接

cross join 笛卡尔积
它是in、exists的高效实现

	select a.* from a left semi join b on (a.id = b.id)
	等价于：
	select a.* from a where a.id in (select b.id from b);

19、常用输入和压缩建表语句
Text 普通文本：

create table `cp_api_log`(`line` string)
partitioned by (`date` string)
row format delimited fields terminated by '\t' lines terminated by '\n'
stored as textfile
 
//-- 或 stored as 直接指定输入输出格式
//-- stored as 
//-- inputformat 'org.apache.hadoop.mapred.TextInputFormat'
//-- outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

ORC / parquet 列式存储：
create external table `cp_api_log`(`line` string)
partitioned by (`date` string)
row format delimited fields terminated by '\t'
stored as orc
//-- stored as parquet
location 'hdfs://hdfs://BigdataCluster/user/hive/warehouse/sm_data.db/ods/ods_cp_api'

ORC + Snappy：
create external table `cp_api_log`(`line` string)
partitioned by (`date` string)
row format delimited fields terminated by '\t'
stored as orc tblproperties("orc.compress"="SNAPPY")
location 'hdfs://hdfs://BigdataCluster/user/hive/warehouse/sm_data.db/ods/ods_cp_api'

根据已有Orc表创建 Orc + Snappy 表：
create table 压缩表 like 原表 tblproperties("orc.compress"="snappy");
insert overwrite table 压缩表 select * from 原表; 

根据非压缩表创建 Orc + Snappy 表并导入数据：
create table 压缩表
stored as orc tblproperties("orc.compression"="snappy")
as select * from 原表;

Parquet  + Snappy：
create external table `cp_api_log`(`line` string)
partitioned by (`date` string)
row format delimited fields terminated by '\t'
stored as parquet tblproperties("parquet.compress"="SNAPPY")
location 'hdfs://hdfs://BigdataCluster/user/hive/warehouse/sm_data.db/ods/ods_cp_api'

JSON 类型输入 + Snappy压缩：
create external table `cp_api_log`(`line` string)
partitioned by (`date` string)
row format serde 'com.cloudera.hive.serde.JSONSerDe' 
stored as orc tblproperties ("orc.compress"="SNAPPY")
location 'hdfs://hdfs://BigdataCluster/user/hive/warehouse/sm_data.db/ods/ods_cp_api'

LZO：
create external table `cp_api_log`(`line` string)
partitioned by (`date` string)
stored as INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
location 'hdfs://hdfs://BigdataCluster/user/hive/warehouse/sm_data.db/ods/ods_cp_api'

20、内置函数：

一、关系运算：	二、数学运算：
等值比较: =	加法操作: +
等值比较:<=>	减法操作: –
小于比较: <	乘法操作: *
小于等于比较: <=	除法操作: /
区间比较	取余操作: %
空值判断:IS NULL	位与操作: &
非空判断:IS NOT NULL	位或操作:
LIKE 比较:LIKE	位异或操作: ^
JAVA的LIKE :RLIKE	位取反操作: ~
REGEXP 操作:REGEXP	四、复合类型构造函数：
三、逻辑运算：	map 结构:map(‘a’,1,‘b’,2) ==>{“a”:1,“b”;2}
逻辑与操作:AND &&	struct 结构
逻辑或操作:OR 、	named_struct 结构
逻辑非操作:NOT、!	array 结构
五、复合类型操作符：	create_union
获取array中的元素	六、数值计算函数：
获取map中的元素	四拾伍入函数: round
获取struct中的元素	向下取整函数: floor
七、集合操作函数	向上取整函数: ceil 、ceiling
map 类型大小：size	取随机数函数: rand
array 类型大小：size	自然指数函数: exp
数组是否包含元素:array_contains	以10为底对数函数:log10
获取 map 中所有 value 集合	以2为底对数函数:log2
数组排序	对数函数: log
八、类型转换函数	幂运算函数: pow
二进制转换：binary	幂运算函数: power
基础类型之间强制转换：cast(id as string)	开平方函数: sqrt
九、日期函数	二进制函数: bin
UNIX 时间戳转日期:from_unixtime	十六进制函数: hex
获取当前UNIX时间戳:unix_timestamp	反转十六进制函数: unhex
字符串转日期函数: to_date	进制转换函数: conv
日期转年函数: year	绝对值函数: abs
日期转月函数: month	正取余函数: pmod
日期转天函数: day	正弦函数: sin
日期转小时函数: hour	十、条件函数
日期转分钟函数: minute	If 函数: if(条件，结果，结果)
日期转秒函数: second	非空查找函数: COALESCE (null,“a”,“b”)
日期转周函数: weekofyr	条件判断函数：CASE when then end
日期比较函数: datediff	十一、字符串函数
日期增加函数: date_add	字符 ascii 码函数:ascii
日期减少函数: date_sub	base64 字符串
今天是2019年11月20日 date_sub(curdate(),interval 0 day) 表示 2019-11-20	字符串连接函数：concat
date_sub(curdate(),interval 1 day) 表示 2019-11-19	带分隔符字符串连接函数:concat_ws
date_sub(curdate(),interval -1 day) 表示 2019-11-21	数组转换成字符串的函数：concat_ws
date_sub(curdate(),interval 1 month) 表示 2019-10-20	小数格式化成字符串函数:format_number
date_sub(curdate(),interval -1 month) 表示 2019-12-20	字符串截取函数：substr,substring
date_sub(curdate(),interval 1 year) 表示 2018-11-20	字符串截取函数：substr,substring
date_sub(curdate(),interval -1 year) 表示 2020-11-20	字符串查找函数：instr
**十二、混合函数	字符串长度函数：length**
调用 Java 函数： java_method	字符串查找函数：locate
调用 Java 函数：reflect	字符串格式化函数：printf
字符串的 hash 值：hash	字符串转换成 map 函数：str_to_map
十三、XPath 解析 XML 函数	base64 解码函数：unbase64(string str)
xpath	字符串转大写函数：upper,ucase
xpath_string	字符串转小写函数：lower,lcase
xpath_boolean	去空格函数：trim
xpath_short, xpath_int,xpath_long	左边去空格函数：ltrim
xpath_float, xpath_double,xpath_number	正则表达式替换函数：regexp_replace(字段名,被替换,替换为)
十四、汇总统计函数（UDAF）	正则表达式解析函数：regexp_extract
个数统计函数: count	URL 解析函数：parse_url
总和统计函数: sum	json 解析函数：get_json_object
平均值统计函数: avg	空格字符串函数：space
最小值统计函数: min	重复字符串函数：repeat
最大值统计函数: max	左补足函数：lpad
非空集合总体变量函数: var_pop	右补足函数：rpad
非空集合样本变量函数: var_samp	分割字符串函数: split
总体标准偏离函数:stddev_pop	集合查找函数: find_in_set
样本标准偏离函数: stddev_samp	分词函数：sentences
中位数函数: percentile	分词后统计一起出现频次最高的 TOP-K
中位数函数: percentile	分词后统计与指定单词一起出现频次最高的TOP-K
近似中位数函数: percentile_approx	十五、表格生成函数 Table-Generating Functions (UDTF)
近似中位数函数: percentile_approx	数组拆分成多行：explode(array)
直方图: histogram_numeric	Map 拆分成多行：explode(map)
集合去重：collect_set	十六、分组排序函数
集合不去重：collect_list	row_number() over (partition by 字段a order by 计算项b desc ) rank
十七、开窗函数over(partition by…)	补充
over(partition by …)主要和聚合函数sum()、count()、max()、avg()等结合使用，实现分组聚合的功能	nvl(expr1, expr2)函数1、如果expr1为NULL，返回值为 expr2，否则返回expr1。2、适用于数字型、字符型和日期型，但是 expr1和expr2的数据类型必须为同类型。
from_utc_timestamp(‘1970-01-01 00:00:00’,“GMT+8”) 把UTC标准时间切换到北京时间