hive高级操作

最新推荐文章于 2024-05-31 08:48:28 发布

EsmeZhao

最新推荐文章于 2024-05-31 08:48:28 发布

阅读量4.3k

点赞数 5

分类专栏：数据仓库hive 文章标签： hive 数据仓库

本文链接：https://blog.csdn.net/m0_37658639/article/details/121149494

版权

数据仓库hive 专栏收录该内容

6 篇文章 5 订阅

订阅专栏

1.数据准备

创建数据库retail_db，在该数据库中创建多张表格

hive (default)>create database retail_db;
hive (default)>use retail_db;

（1）创建customers顾客表，并插入数据。

customer_id	顾客编号	int
customer_fname	顾客名字	string
customer_lname	顾客姓氏	string
customer_email	邮箱	string
customer_password	密码	string
customer_street	街道	string
customer_city	城市	string
customer_ state	州	string
customer_zipcode	邮编	string

hive(retail_db)> create table if not exists customers(customer_id int,customer_fname string,customer_lname string,customer_email string,customer_password string,customer_street string,customer_city string,customer_state string,customer_zipcode string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ("separatorChar"=",");
hive(retail_db)>load data local inpath ‘/export/data/retail/customers.csv’ into table customers;

（2）创建orders订单表，并插入数据。

order_id	订单编号	int
order_date	订单日期	date
order_customer_id	顾客编号	int
order_status	订单状态	string

hive(retail_db)>create table if not exists orders(order_id int, order_date date, order_customer_id int,order_status string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ("separatorChar"=",");
hive(retail_db)>load data local inpath ‘/export/data/retail/orders.csv’ into table orders;

（3）创建order_items订单明细表，并插入数据。

order_item_id	订单明细编号	int
order_item_order_id	订单编号	int
order_item_product_id	商品编号	int
order_item_quantity	数量	int
order_item_subtotal	总价	float
order_item_product_price	商品单价	float

hive(retail_db)>create table if not exists order_items (order_item_id int,order_item_order_id int, order_item_product_id int,  
order_item_quantity int,order_item_subtotal float,order_item_product_price float) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties ("separatorChar"=",");
hive(retail_db)>load data local inpath ‘/export/data/retail/order_items.txt’ into table order_items;

（4）创建products商品表，并插入数据。

product_id	商品编号	int
product_category_id	商品分类编号	int
product_name	商品名字	string
product_description	商品描述	string
product_price float	商品单价	float
product_image	商品照片	string

hive(retail_db)>create table if not exists products(product_id int, product_category_id int,product_name string,product_description string, product_price float,product_image string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ("separatorChar"=",");
hive(retail_db)>load data local inpath ‘/export/data/retail/products.txt’ into table products;

（5）创建categories商品表，并插入数据。

category_id	商品分类编号	int
category_department_id	部门编号	int
category_name	商品分类名字	string

hive(retail_db)>create table if not exists categories(category_id int, category_department_id int,category_name string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ("separatorChar"=",");
hive(retail_db)>load data local inpath ‘/export/data/retail/categories.txt’ into table categories;

（6）创建departments部门表，并插入数据。

department_id	部门编号	int
department_name	部门名字	string

hive(retail_db)>create table if not exists departments(department_id int,department_name string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties("separatorChar"=",");
hive(retail_db)>load data local inpath ‘/export/data/retail/departments.txt’ into table departments;

2.查询

(1) 查询所在州为“NY”,所在城市为“New York”的用户

hive (retail_db)> select * from customers where customer_state = 'NY' and customer_city = 'New York';

(2) 查询出订单表中共有多少不同顾客下过单

hive (retail_db)> select distinct order_customer_id from orders;

(3) 返回商品列中的前5个商品

hive (retail_db)> select * from products limit 5;

3.连接

(1) 获取每个订单对应的购买商品的列表

hive (retail_db)> select order_id,order_item_product_id from orders join order_items on order_id= order_item_order_id;

(2) 获取没有下订单的所有顾客信息

hive (retail_db)> select customer_id,order_id from customers left join orders on customer_id = order_customer_id where order_customer_id is null;

4.排序

(1) order by 全局排序

统计order_items表中销量最多前10单

hive (retail_db)>  select * from order_items order by order_item_quantity desc limit 10;

(2) sort by 局部有序

查看商品分类表，结果按分类名称升序排列

hive (retail_db)> set mapred.reduce.tasks=2;
hive (retail_db)> select * from categories sort by category_name;

(3) distribute by 控制Map输出数据在Reducer中的划分

查看商品分类表，结果按分类名称升序排列，要求同一大类的商品分类在同一个Reducer中处理，distribute by必须在sort by之前

查看商品分类表，结果按分类名称升序排列,同一大类的商品在一个Reducer内

hive (retail_db)> set mapred.reduce.tasks=2;
hive (retail_db)> select * from categories distribute by category_department_id sort by category_name;

(4) cluster by 当sort by 与distribute by处理对象是同一个字段可以使用cluster by

5.分组

注意：使用分组，select后面字段只能跟着聚合函数和分组字段，否则报错。

统计订单明细表中销量排名前10的商品

hive (retail_db)> select order_item_product_id,sum(order_item_quantity) as total from order_items group by order_item_product_id order by total desc limit 10;

having 与 where 不同点

（1）where 针对表中的列发挥作用，查询数据；having 针对查询结果中的列发挥作用，

筛选数据。

（2）where 后面不能写分组函数，而 having 后面可以使用分组函数。

（3）having 只用于 group by 分组统计语句

统计订单明细表中销量不超过1000的商品的top10

hive (retail_db)> select order_item_product_id,sum(order_item_quantity) as total from order_items group by order_item_product_id having total <= 1000 order by total desc limit 10;

6.内置函数

(1) 查看函数

hive (retail_db)> show functions;
hive (retail_db)>desc function extended abs;

(2) 字符函数

返回值	函数	说明
string	concat(a, b,...)	它返回从a后串联b产生的字符串
string	concat_ws(sep,a,b,...)	与concat类似，但第一个参数为指定的连接符
string	concat_ws(sep,array<string>)	与concat_ws类似，但拼接的是指定array中的元素
array<>	split(str,pat)	按正则表达式pat来分割字符串str，并将分割后的结果以数组返回
map<string,string>	str_to_map(text[,deli1,deli2])	将字符串转换为map，第一个参数是需要转换的字符串，第二个参数是元素之间分隔符，默认”,”，第三个参数是键值分隔符默认“=”
int	instr(str,substr)	查找字符串str中子字符串substr出现的位置，如果查找失败返回0
string	substr(a,start[,len])	对字符串a从start位置开始截取长度为len的字符串并返回
int	locate(substr,str[,pos])	查找字符串str中pos位置后字符串substr第一次出现的位置
int	length(a)	返回字符串长度
string	lower(a)	将字符串中所有字母转换成小写
string	upper(a)	将字符串中所有字母转换成大写

hive (retail_db)> select customer_fname,customer_lname from customers limit 10;
hive (retail_db)> select concat(customer_fname, customer_lname) from customers limit 10;
hive (retail_db)> select concat_ws('-',customer_fname, customer_lname) from customers limit 10;
hive (retail_db)> select concat_ws('-',array('hadoop', 'hive'));
hive (retail_db)> select split(concat_ws('-',array('hadoop', 'hive') ), '-');
hive (retail_db)>select str_to_map('hadoop=1,hive=2', ',', '=');
hive (retail_db)> select instr('hadoop-hive','d');
hive (retail_db)> select substr(' hadoop-hive',1,6);
hive (retail_db)> select locate('h','hadoop-hive');
hive (retail_db)> select locate('h','hadoop-hive',5);
hive (retail_db)> select concat_ws(customer_fname, customer_lname),
length(concat_ws(customer_fname, customer_lname)) from customers limit 10;
hive (retail_db)> select lower(upper(concat_ws(customer_fname, customer_lname))) from customers limit 10;
hive (retail_db)> select upper(concat_ws(customer_fname, customer_lname)) from customers limit 10;

(3) 聚合函数

返回值	函数	说明
T	max(col)	返回col中最大值
T	min(col)	返回col中最小值
T	avg(col)	返回col的平均值
int	count(col)	返回col中数据个数
T	sum(col)	返回col的和
array<>	collect_set(col)	返回消除了重复元素的数组
array<>	collect_list(col)	返回允许存在重复元素的数组

统计2号顾客下过多少订单

hive (retail_db)> select count(order_id) from orders where order_customer_id =2;

统计502商品的销售额

hive (retail_db)> select sum(order_item_subtotal) from order_items where order_item_product_id = 502;

(4) 类型转换函数

返回值	函数	说明
binary	binary(string\|binary)	将输入的值转换成二进制
<type>	cast(expr as <type>)	将expr转换成type类型，转换失败返回NULL

hive (retail_db)> select bin(8);
hive (retail_db)>select product_price,cast(product_price as int) from products limit 5;

(5) 数学函数

返回值	函数	说明
double	round(a)	返回对a四舍五入的值，小数位为0
double	round(a,d)	返回对a四舍五入的值，并且保留d位小数
bigint	floor(a)	向下取整，如floor(3.84) = 3

hive (retail_db)> select order_item_subtotal,round(order_item_subtotal) from order_items limit 5;
hive (retail_db)> select order_item_subtotal,floor(order_item_subtotal) from order_items limit 5;
hive (retail_db)> select round(rand());

(6) 日期函数

返回值	函数	说明
string	from_unixtime(unixtime[,format])	将时间的秒值转换成format格式，如”yyyy-MM-dd hh:mm:ss”
bigint	unix_timestamp()	获取当前本地时区下的时间戳
bigint	unix_timestamp(date)	将”yyyy-MM-dd hh:mm:ss”格式的时间字符串转为时间戳
string	to_date(timestamp)	返回时间字符串的日期部分
int	year(date)	返回时间字符串的年份部分
int	month(date)	返回时间字符串的月份部分
int	day(date)	返回时间字符串的天
int	hour(date)	返回时间字符串的小时
int	minute(date)	返回时间字符串的分钟
int	second(date)	返回时间字符串的秒钟
int	datediff(startdate,enddate)	计算开始时间到结束时间的天数
string	date_add(startdate,days)	从startdate时间开始加上days天
string	date_sub(startdate,days)	从startdate时间开始减去days天

hive (retail_db)> select unix_timestamp();
hive (retail_db)> select unix_timestamp("2020-11-22 20:49:00");
hive (retail_db)> select from_unixtime(1606096140);
hive (retail_db)> select from_unixtime(1606096140,"yyyy-MM-dd");
hive (retail_db)> select to_date("2020-11-22 20:49:00");
hive (retail_db)> select year("2020-11-22 20:49:00");
hive (retail_db)> select datediff("2020-03-01","2020-01-01");
hive (retail_db)> select date_add("2020-11-22",10);
hive (retail_db)> select date_sub("2020-11-22",10);

(7) 条件函数

返回值	函数	说明
T	if(testCondition,valueTrue,valueFalseOrNull)	如果testCondition为True，返回valueTrue，否则返回valueFalse
Boolean	isnull(a)	如果a为Null返回True，否则返回False
boolean	isnotnull(a)	如果a不为Null返回True，否则返回False
T	nvl(value,defaultValue)	如果value为Null，返回defaultValue，否则返回 value
T	coalesce(v1,v2,…)	返回第一个非Null的值，如coalesce(NULL,1,2)=1
T	case a when b then c [when d then e] [else f] end	当a=b时返回c，当a=d时返回e，否则返回f
T	case when a then b [when c then d] [else e] end	当a=true返回b，当c=true返回d否则返回e

hive (retail_db)> select isnull(NULL); 
hive (retail_db)> select isnotnull(NULL);
hive (retail_db)> select isnull("NULL");
hive (retail_db)> select nvl(order_status,'bla') from orders limit 5;
hive (retail_db)> select coalesce(null,1,2);
hive (retail_db)> select count(*) from products where if(product_price > 200,true,false);
hive (retail_db)> select product_price, case when product_price<100 then 1 when product_price between 100 and 200 then 2 else 3 end as level from products;
hive (retail_db)> select level,count(*) from (select *, case when product_price<100 then 1 when product_price between 100 and 200 then 2 else 3 end as level from products) as a group by level;

(8) 集合函数

返回值	函数	说明
int	size(map<k,v>)	返回map中键值对个数
int	size(array<T>)	返回数组长度
array<K>	map_keys(map<k,v>)	返回map中所有key
array<V>	map_values(map<k,v>)	返回map中所有value
boolean	array_contains(array<T>，value)	查询array中是否包含value
array<T>	sort_array(array<T>)	对数组进行排序

hive (retail_db)>select size(map('a','b','c','d'));
hive (retail_db)>select map_keys(map('a','b','c','d'));
hive (retail_db)> select map_values(map('a','b','c','d'));
hive (retail_db)>select array_contains(array('a','b','c','d'),'b');
hive (retail_db)> select sort_array(array('a','f','b','e','c','d'));

(9) 表生成函数

返回值	函数	说明
N rows	explode(array<T>)	对array中每个元素生成一行且包含该元素
N rows	explode(map<k,v>)	对map中的每个键值对生成一行，其中一个字段是键，另一个是值
N rows	posexplode(array<T>)	类似于explode()，但额外返回一列包含了元素所在位置
N rows	stack(n,v1,…,vk)	将k列转换为n行，每行有k/n个字段，n必须是常数
tuple	json_tuple(jsonStr,k1,k2,…)	从JSON字符串中获取多个键并做一个元组返回
tuple	parse_url_tuple(url,p1,p2,…)	将array中的结构体元素提取成每一行

hive (retail_db)> select explode(array('Apple','Orange','Mongo'));
hive (retail_db)> select explode(map('A','Apple','O','Orange'));
hive (retail_db)> select posexplode(array('Apple','Orange','Mongo'));
hive (retail_db)> select stack(1,'a','b','c','d');
hive (retail_db)> select stack(2,'a','b','c','d');
hive (retail_db)> select stack(4,'a','b','c','d');
hive (retail_db)> select json_tuple('{"name":"Jason","age":"18"}','name','age');
hive (retail_db)> select parse_url('https://www.baidu.com','HOST');

EsmeZhao

关注

5
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
hive高级操作

1.数据准备创建数据库retail_db，在该数据库中创建多张表格hive (default)>create database retail_db;hive (default)>use retail_db;（1）创建customers顾客表，并插入数据。 customer_id 顾客编号 int customer_fname 顾客名字 ...
复制链接

扫一扫

专栏目录