hive学习笔记一——基础

最新推荐文章于 2024-08-09 17:47:29 发布

五花尾巴

最新推荐文章于 2024-08-09 17:47:29 发布

阅读量190

点赞数

文章标签： hive

本文链接：https://blog.csdn.net/NotMeYa/article/details/108852373

版权

hive查询的是HDFS上的数据库，可安装在任意一台机器。

首先，设置基本参数，让hive使用起来更便捷
1、让提示符显示当前库 hive> set hive.cli.print.current.db=true;
2、显示查询结果时显示字段名称  hive> set hive.cli.print.header=true;

在linux的当前用户目录中，编辑一个.hiverc文件，将参数写入其中：
vi .hiverc
set hive.cli.print.header=true;
set hive.cli.print.current.db=true;

在hive-1.2.1下：
启动服务在后台运行：netstat -nltp端口号有10000即服务启动成功
启动beeline连接服务：
bin/beeline -u jdbc:hive2://hadoop102:10000 -n root

hive脚本使用方式

eg：hive -e "select count(1) from default.t_order"
运行完自动返回到linux界面

用语句编写一个脚本
在hivetest下，vi etl.sh
#!/bin/bash
	hive -e "create table t_count_set(sex string,number int)"
	hive -e "insert into table t_count_sex select sex,count(1) from default.t_order group by sex"
运行该shell脚本：sh etl.sh
登录hive，show tables可查看到新表
【如果SQL语句复杂，可将其写成文件，如test.sql，再运行hive -f test.hql】

hive基本语法

新建库    create database db_order;
库创建完成后会在HDFS中生成一个库目录 db_order.db

新建表    use db_order;
				create table t_order(id string,create_time string,amount float,uid string);
表建立完成后会在库目录中生成一个表目录

正确的建表语句：
create table t_order(id string,create_time string,amount float,uid string) row format delimited fields terminated by ',' ;
即指定空表数据文件中的字段分隔符为 “，”；

删除表：
drop table t_order;
对于drop命令：hive会从元数据库中清除关于这个表的信息，从HDFS中删除这个表的表目录。

内部表和外部表

内部表（MANAGED_TABLE）：表目录按hive规范部署，位于hive的仓库目录 /user/hive/warehouse中。
外部表（EXTERNAL_TABLE）：表目录由建表用户指定。
内部表和外部别的特性差别：
a、内部表的目录在hive的仓库目录中，外部表的目录由用户指定
b、drop内部表时，hive会清除相关元数据并删除表数据目录
c、drop外部表时，hive只会清除相关元数据。

分区表

实质：在表目录中为数据文件创建分区子目录，以便于在查询时，MR程序可以针对分组子目录中的数据进行处理，缩减读取数据的范围。
如：
create table t_nbiot_log(ip string,commit_time string) partitioned by (day string) row format delimited fields terminated by ',';

在文件所在机器打开一个客户端

beeline -u jdbc:hive2://hadoop102:10000 -n root
>show databases;
use nbiot;  //库
show tables;
load data local inpath '/root/hivetest/nbiot.log.15' into table t_nbiot_log partition(day='20200928');
【注意：分区字段不能是表定义中已存在的字段】

CATS建表语法

通过已存在表来建表：
1、create table t_user_2 like t_user; 表结构相同但无数据
建表的同时插入数据：
2、create table t_access_user as select ip,url from t_access;  同时将查询的结果插入新表。

数据导入导出

方式1：手动用HDFS命令，将文件放入表目录

方式2：在hive的交互式shell中用hive命令导入本地数据到目录【复制】
hive> load data local inpath '/root/order.data.2' into table t_order;

方式3：用hive命令导入HDFS中的数据文件到表目录【移动】
hive>load data inpath '/access.log.2020-09-28.log' into table t_access partitiion (dt='20200928');

hive查询语法

基本查询：select * from t_access;
					select count(*) from t_access;
					select max(ip) from t_access;
				
条件查询：
select * from t_access where access_time<'2020-09-28 18:30:12';
select * from t_access where access_time<'2020-0928 19:30:12' and ip>'192.168.00.0';

join查询：
1、内连接 inner join
eg：select a.*,b.* from t_a a join t_b b on a.name=b.name;

2、外连接
a、左外连接（左连接）
select a.*,b.* from t_a a left outer join t_b b on a.name=b.name;
b、右外连接（右连接）
select a.*,b.* from t_a a right outer join t_b b on a.name=b.name;
c、全外连接
select a.*,b.* from t_a a full outer join t_b b on a.name=b.name;
d、左半连接
select a.* from t_a a left semi join t_b b on a.name = b.name;

group by分组聚合

eg：select ip,upper(url),access_time from t_nbiot_log;
对数据中每一行进行逐行计算
eg：select url,count(1) as cnts from t_nbiot_log group by url;
对分好组的数据作逐行计算
eg：select ip，url，max(access_time) from t_nbiot_log group by ip,url;
每个用户访问同一个页面的所有记录中，时间最晚的一条
eg：select url,max(ip) from t_nbiot_log group by url;
每个URL访问者中ip最大的

数据类型（hive）

数字类型：
TINYINT  -128~127的整数
SMALLINT    2个子节的小整数
INT/INTEGER   整型
BIGINT
FLOAT
DOUBLE

日期时间类型：
TIMESTAMP、DATE

字符串类型： STRING   VARCHAR  CHAR

混杂类型：BOOLEAN   BINARY

复合类型：array数组类型 array<string>
eg:create table t_movie (movie_name string,actors array<string>,first_show date) row format delimited fields terminated by ',' collection items terminated by ':';
集合元素以 ：分割。
查询：
select movie_name,actors[0],first_show from t_movie;
select movie_name,actors,first_show from t_movie where array_contains(actors,'黄渤');——包含黄渤的电影
size(actors)  ——数组里有几个元素

map类型：
map<string,string>
分割时：map keys terminated by ':';
查询：family_members["father"]   key为“father”对应的值是什么
		   map_keys(family_members)    有哪些key
		   map_values(family_members)    有哪些value
		   size(family_members)   有几个value/key

struct类型：
	struct：STRUCT<col_name:data_type,……>
	eg：info struct<age:int,sex:string,addr:string>
	查询时：info.addr

五花尾巴

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive学习笔记一——基础

hive查询的是HDFS上的数据库，可安装在任意一台机器。首先，设置基本参数，让hive使用起来更便捷1、让提示符显示当前库 hive> set hive.cli.print.current.db=true;2、显示查询结果时显示字段名称 hive> set hive.cli.print.header=true;在linux的当前用户目录中，编辑一个.hiverc文件，将参数写入其中：vi .hivercset hive.cli.print.header=true;set hi
复制链接

扫一扫