Hive 功能:
为用户提供编写SQL功能,将SQL语句转化为MapReduce程序,提交应用到YARN集群上,读取存储在HDFS上的数据,进行分析处理。
核心:
Hive建立在HADOOP 框架之上的
Hive 数据仓库Warehouse
- database
空间说明
- table
将表中的每条数据 映射到 文件中每行数据
- 每条数据在文件中的分隔符,默认换行符\n
LINES TERMINATED BY '\n'
- 每条数据中各个字段在文件中分隔符,默认值\0001
FIELDS TERMINATED BY ','
Hive 2.x 底层:
- MapReduce
最原始、原生态
- Spark
官方推荐
- Tez
Hortonworks
开发MapReduce程序的时候
- 本地模式
Local Mode, 所有的MapTask和ReduceTask都是运行在一个JVM中
- 集群模式
每个Task都是独立运行在一个JVM中
设置Hive中运行在本地模式
set hive.exec.mode.local.auto = true ;
当MapReduce处理的数据大小小于128MB或者MapTask的个数小于5个,自动将程序运行在本地模式下。
假设 对于 链家 九个一线城市 二手房信息
每个月 月末 去网站 爬取一次数据,存储起来,存储到Hive表中
二手房信息分析
-1. 与上个月相比
增长还是降低
-2. 与去年同月相比
增长还是降低
Hive 中 特殊表:
分区表: PARTITION
1. 分区表如何管理数据的呢???
普通表数据管理存储:
将表中的所有数据文件存储在表对应的HDFS上存储目录下
db_0509.db/2nd_sh_house/*.data */
分区表:
按照分区字段创建对应的分区目录,划分数据进行存储
db_0509.db/tb_second_price
BJ,04月份二手房数据
db_0509.db/tb_second_price/month_str=04/city=bj/*.data */
SH,04月份二手房数据
db_0509.db/tb_second_price/month_str=04/city=sh/*.data */
2. 分区表如何分析数据的呢???
SELECT COUNT(1) AS cnt FROM tb_second_price WHERE month_str = '04' AND city = 'bj' ;
总结分区表:
-1. 将数据 重新分布在不同的目录下
通过分区自负安
从HDFS目录结构可以看出
-2. 创建分区表 和 加载数据
-a. CREATE TABLE(...)
PARTITIONED BY (month_str STRING, city STRING)
分区字段,对应到HDFS上就目录
-b. LOAD DATA ...
PARTITION (month_str='04', city='bj')
既然是分区表,加载数据的时候,一定要指定分区字段的值,否则加载数据不成功
-3. 数据分析查询,减少读取的数据量
最好指定分区字段,通过WHERE语句指定,例如
WHERE month_str = '04' AND city = 'bj'
在实际生产环境中,对分区表分析必须指定分区字段,Hive中有个属性可以设置:
set hive.mapred.mode = strict ;
<property>
<name>hive.mapred.mode</name>
<value>nonstrict</value>
<description>
The mode in which the Hive operations are being performed.
In strict mode, some risky queries are not allowed to run. They include:
Cartesian Product.
No partition being picked up for a query.
Comparing bigints and strings.
Comparing bigints and doubles.
Orderby without limit.
</description>
</property>
难点:
如何将数据加载到分区表中??????
回顾一下;
如何加载数据到Hive表中:
-1. load data
-a. 加载本地数据
bin/hdfs dfs -put *.data /...
-b. 加载hdfs上数据
bin/hdfs dfs -mv /.../*.data /...
-2. put
直接将数据文件存储到hive表所对应的hdfs存储目录
房屋建筑时间分布
year_group
5年以内 10年以内 15年以内 20年以内 20年以上
分析:
build_date 获取 房龄
1987年建 截取字符串
use second_house;
SELECT
t.year_group, COUNT(1) AS total
FROM(
SELECT
build_date,
CASE
WHEN (2018 - substring(build_date, 0, 4)) BETWEEN 0 AND 5 THEN '5年以内'
WHEN (2018 - substring(build_date, 0, 4)) BETWEEN 5 AND 10 THEN '10年以内'
WHEN (2018 - substring(build_date, 0, 4)) BETWEEN 10 AND 15 THEN '15年以内'
WHEN (2018 - substring(build_date, 0, 4)) BETWEEN 15 AND 20 THEN '20年以内'
ELSE '20年以上'
END AS year_group
FROM
2nd_sh_house
) t
GROUP BY
t.year_group
ORDER BY
total DESC ;
t.year_group total
20年以上 16056
15年以内 4506
10年以内 3883
20年以内 3117
5年以内 639
Time taken: 12.396 seconds, Fetched: 5 row(s)
查看函数
show functions;
desc function substring;
substring(str, pos[, len]) - returns the substring of str that starts at pos and is of length len orsubstring(bin, pos[, len]) - returns the slice of byte array that starts at pos and is of length len
Time taken: 0.058 seconds, Fetched: 1 row(s)
分区表:
以链家二手房信息为主
分区字段:
城市、月份
#按照月份和城市进行分区
#PARTITIONED BY (month_str STRING COMMENT 'First Partition Month', city STRING COMMENT 'Second Partition city')
#标的注释
#COMMENT 'LianJia ErShouFang Price'
CREATE TABLE tb_second_price(
village_name string,
house_type string,
house_area double,
region string,
house_floor string,
direction string,
total_price double,
single_price double,
build_date string
)
COMMENT 'LianJia ErShouFang Price'
PARTITIONED BY (month_str STRING COMMENT 'First Partition Month', city STRING COMMENT 'Second Partition city')
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
向分区表中 加载数据的时候,一定要指定分区字段的值
-- 语法:
-- LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
#加载BJ,4月份二手房数据 SH,04月份二手房数据
[rdedu@bigdata-training01 datas]$ cp 2nd_house_price.csv 04bj_2nd_house_price.csv
[rdedu@bigdata-training01 datas]$ cp 2nd_house_price.csv 04sh_2nd_house_price.csv
LOAD DATA LOCAL INPATH '/opt/datas/04bj_2nd_house_price.csv' INTO TABLE tb_second_price PARTITION (month_str='04', city='bj') ;
LOAD DATA LOCAL INPATH '/opt/datas/04sh_2nd_house_price.csv' INTO TABLE tb_second_price PARTITION (month_str='04', city='sh');
验证:
/user/hive/warehouse/second_house.db/tb_second_price/month_str=04
drwxr-xr-x rdedu supergroup 0 B 2019/9/7 下午10:42:25 0 0 B city=bj
drwxr-xr-x rdedu supergroup 0 B 2019/9/7 下午10:42:37 0 0 B city=sh
-- 对分区表的数据进行查询
SELECT COUNT(1) AS cnt FROM tb_second_price WHERE month_str = '04' AND city = 'bj' ;
SELECT COUNT(1) AS cnt FROM tb_second_price WHERE month_str = '04' AND city = 'sh' ;
cnt
28201
Time taken: 35.127 seconds, Fetched: 1 row(s)
-- 需求1:分析上海 4 和 3 月份数据结果对比
SELECT '03' AS mon, COUNT(1) AS cnt FROM tb_second_price WHERE month_str = '03' AND city = 'sh'
UNION ALL
SELECT '04' AS mon, COUNT(1) AS cnt FROM tb_second_price WHERE month_str = '04' AND city = 'sh' ;
-- 需求2:分析所有的4月份的数据
SELECT '04' AS mon, COUNT(1) AS cnt FROM tb_second_price WHERE month_str = '04' ;
-- 需求3:分析所有的sh的数据
SELECT '04' AS mon, COUNT(1) AS cnt FROM tb_second_price WHERE city = 'sh' ;
直接使用put命令 将 nj 和 hf的4月份数据加载到 分区表中
-- 1. 创建对应的分区目录
bin/hdfs dfs -mkdir -p /user/hive/warehouse/second_house.db/tb_second_price/month_str=04/city=nj ;
bin/hdfs dfs -mkdir -p /user/hive/warehouse/second_house.db/tb_second_price/month_str=04/city=hf ;
-- 2. 上传对应城市的数据到对应分区目录中
[rdedu@bigdata-training01 datas]$ cp 04bj_2nd_house_price.csv 04nj_2nd_house_price.csv
[rdedu@bigdata-training01 datas]$ cp 04bj_2nd_house_price.csv 04sz_2nd_house_price.csv
bin/hdfs dfs -put /opt/datas/04nj_2nd_house_price.csv /user/hive/warehouse/second_house.db/tb_second_price/month_str=04/city=nj ;
bin/hdfs dfs -put /opt/datas/04sz_2nd_house_price.csv /user/hive/warehouse/second_house.db/tb_second_price/month_str=04/city=hf ;
-- 3. 添加分区
-- load可以直接添加进去,但是put方法进去的数据需要进行添加分区
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
ALTER TABLE second_house.tb_second_price ADD IF NOT EXISTS PARTITION (month_str = '04', city = 'nj');
ALTER TABLE second_house.tb_second_price ADD IF NOT EXISTS PARTITION (month_str = '04', city = 'hf');
查看 分区表中分区信息
show partitions tb_second_price ;
partition
month_str=04/city=bj
month_str=04/city=hf
month_str=04/city=nj
month_str=04/city=sh
Time taken: 0.43 seconds, Fetched: 4 row(s)
SELECT '04' AS mon, COUNT(1) AS cnt FROM tb_second_price WHERE month_str = '04' AND city = 'nj';
mon cnt
04 28201
Time taken: 32.846 seconds, Fetched: 1 row(s)
SELECT '04' AS mon, COUNT(1) AS cnt FROM tb_second_price WHERE month_str = '04' AND city = 'hf';
mon cnt
04 28201
Time taken: 32.846 seconds, Fetched: 1 row(s)
结合 分区表 和 外部表 使用
-- 表的数据存储目录
bin/hdfs dfs -mkdir -p /user/rdedu/lianjia_2nd_house ;
-- 外部表 + 分区表
#LOCATION '/user/rdedu/lianjia_2nd_house'; 外部表一般要指定地址
CREATE EXTERNAL TABLE second_house.lianjia_2nd_house(
village_name string,
house_type string,
house_area double,
region string,
house_floor string,
direction string,
total_price double,
single_price double,
build_date string
)
PARTITIONED BY (year_str STRING, month_str STRING, city STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/rdedu/lianjia_2nd_house';
-- 采用put方式加载数据
-- a. 创建分区目录
bin/hdfs dfs -mkdir -p /user/rdedu/lianjia_2nd_house/year_str=2018/month_str=04/city=sh ;
-- b. 上传数据
bin/hdfs dfs -put /opt/datas/04sh_2nd_house_price.csv /user/rdedu/lianjia_2nd_house/year_str=2018/month_str=04/city=sh ;
-- c. 添加分区
ALTER TABLE second_house.lianjia_2nd_house ADD IF NOT EXISTS PARTITION (year_str = '2018', month_str = '04', city = 'sh') ;
show partitions second_house.lianjia_2nd_house;
partition
year_str=2018/month_str=04/city=sh
Time taken: 0.166 seconds, Fetched: 1 row(s)