Hadoop基础知识-05

内部表（create table table_name......），未被external关键字修饰的表就是内部表，内部表数据存储的位置由hive.metastore.warehouse.dir参数决定（默认/user/hive/warehouse）,删除内部表会直接删除元数据和存储的数据，因此内部表不适合与其他工具共享数据。

外部表（create external table table_name......location......）,被external关键字修饰的表是外部表，外部表之所以称之为外部表是因为表数据可以存在任何位置，通过location关键字指定。数据存储的不同也代表了这个表在理念上并不是Hive内部管理的，而是可以随意临时链接到外部数据上。在删除外部表时，仅仅删除元数据，不会删除表本身。

打个比方，内部表就是公司里的正式员工，已经签了协议长久使用的员工，外部表就是一些临时员工，忙碌的这段时间用一下。

（1）内部表操作

use myhive;

create table stu(

id int,

name string

);

insert into stu values(1,'zhangsan'),(2,'wangwu'); # 执行很慢因为在Hive中它会走

mapreduce程序。

select * from stu;

hadoop fs -ls /user/hive/warehouse/myhive.db/stu

hadoop fs -cat /user/hive/warehouse/myhive.db/stu/*

数据库中的表实质上是HDFS中的文件夹

可以看出在数据库中我们可以清晰的看出id列与name列，但是在HDFS中id列与name列紧紧的靠在了一起，好像没有分隔符，计算机是如何识别的呢？

在HDFS文件系统中并不是没有分隔符，只是分隔符隐藏了。默认的分隔符是“\001”(ASCII)，在某些文本编辑器中显示为SOH或^A，即上述stu表中只是分隔符隐藏起来了，实质上为：

1SOHzhangsan

2SOHwangwu

我们可以将000000_0文件下载下来看一下：

vim 000000_0

默认的分隔符是\001，当然我们可以自己指定分隔符

create table if not exists stu2(id int, name string)row format delimited fields terminated by '\t';

内部表删除后数据本身与元数据都会被删除

drop table stu;

（2）外部表操作

从外部表的特点可以看出外部表与数据之间是相互独立的，删除外部表只删除元数据，而数据本身并不会被删除。

可以先有表，然后把数据移动到表指定的location中

也可以先有数据，然后创建表通过location指向数据

数据准备：

第一种方式先有表：

create external table test_ext1(

id int,

name string

)row format delimited fields terminated by '\t' location '/tmp/test_ext1';

创建之前确保test_ext1文件夹不存在

select * from test_ext1; # 没结果

hadoop fs -put test_external.txt /tmp/test_ext1/

select * from test_ext1;

第二种方式先有数据

hadoop fs -mkdir /tmp/test_ext2

hadoop fs -put test_external.txt /tmp/test_ext2/

create external table test_ext2(

id int,

name string

)row format delimited fields terminated by '\t' location '/tmp/test_ext2';

select * from test_ext2;

删除外部表只删除元数据，数据本身不会被删除

（3）内部表与外部表的转换

Hive可以很简单的通过SQL语句转换内外部表

desc formatted test_ext2；： 查看表的类型......等详细信息

alter table 表名 set tblproperties('EXTERNAL'='TRUE'); : 内部表转外部表

alter table 表名 set tblproperties('EXTERNAL'='FALSE); ：外部表转内部表

注意：'EXTERNAL'='TRUE'与'EXTERNAL'='FALSE是固定语法，区分大小写。

4、数据加载与导出

（1）数据加载（从外部将数据加载到Hive内）

load语法：

在Hive创建一个空表test_load:

在linux系统和HDFS上创建一个search_log.txt文件：

load data local inpath '/test/search_log.txt' into table test_load;

load data inpath '/search_log.txt' overwrite into table test_load;

select * from test_load;

注意：将Linux系统上的文件加载到Hive后，linux系统上的文件还在，相当于把文件复制过去了。将HDFS系统上的文件加载到Hive后，原始文件不存在，相当于把文件移动过去了。通过load方法加载数据的速度是很快的。

insert select语法：

将select查询语句的结果插入到其他表，被select查询的表可以是内部表或内部表。

insert [overwrite | into] table 表名 select 字段,... from 表名;

（2）数据导出（将Hive表中的数据导出到linux系统下或hdfs系统下）

insert overwrite语法：

insert overwrite [local] directory 'path' select 字段.... from 表名；

例如：

最好directory ‘path’ 是一个空的目录，如果path里有数据，在数据导出时会将原来的数据全部删除。

hive命令方式导出数据到linux系统上

语法：hive [-f | -e] 执行SQL语句或脚本 > 文件路径

-e ：指定SQL语句

-f ：指定SQL脚本

hive -e "select * from myhive.test_load;" > /test/1.txt

hive -f /test/1.sql > /test/2.txt

5、分区表

与大数据分布式中分而治之的思想一样，在Hive中，可以将大的数据按照每天或每小时进行切分成多个文件。即将表分割成若干个分区，每个分区在HDFS中就是一个文件夹，每个分区中记录一部分数据。

构建分区表：

create table 表名(......) partitioned by (分区列列类型,......) row format delimited fields terminated by '分隔符'；

在创建表时，分区列是单独的一列。

select * from score;

6、分桶表

与分区一样，都是将大的数据进行划分，但与分区不同的是，分区是将表拆分到不同的子文件夹中进行存储，而分桶是将表拆分到固定数据的不同文件中进行存储。每个分桶在HDFS中就是一个文件。

set hive.enforce.bucketing=true; : 开启分桶自动优化（后续走MapReduce程序时，reduce task的个数将自动优化与分桶数一致）

创建分桶表：

create table course(c_id string, c_name string, t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

向分桶表中加载数据时，load data方法无法执行，只能通过insert select语句加载，因此比较好的方法是，先事先创建一个临时表，通过load data加载数据进临时表，然后通过insert select从临时表向分桶表插入数据。

create table course_common(c_id string, c_name string, t_id string) row format delimited fields terminated by '\t';

load data local inpath '文件' into table course_common;

insert overwrite table course select * from course_common;

问题：为什么不可以用load data语句向分桶表中加载数据呢？

回答：如果没有分桶设置，那么加载数据时只是简单的将数据复制或移动到对应的文件夹中。一旦有了分桶设置，例如分桶数为3，那么在加载数据时，就不单单是简单的复制移动操作了，需要将内容划分为3份，分别放入到3个分桶文件中。问题就在于如何将数据划分为3份，划分规则是什么？分桶表数据的划分是基于分桶列的值进行hash取模来决定，但是load data语句无法触发MapReduce程序。因此使用load data语法加载数据就无法执行hash计算，相当于无法完成数据的划分，所以向分桶表中加载数据不能使用load data。而insert select语句是走MapReduce程序的。

hash取模原理：

hash取模基于hash值的计算，同样的值经hash加密后的结果是一致的。例如'hadoop'被hash计算后的值假设为123456，那么无论计算多少次，'hadoop'的hash值都是123456。hash取模就是将hash加密的结果基于分桶数取模，假设分桶数为3，即将hash加密的结果以3取模，最终计算的结果只能取到0，1，2。

基于hash取模的特点，我们会发现未上传的数据c_id列是按顺序存放的，但加载到分桶表后c_id列随意排放，这就是因为在数据划分时顺序被打乱了。此外，c_id列相同的数据一定在一个分桶文件中。

原数据：

分桶表数据：

分桶表的性能提升：

首先与分区表一样，将大的数据划分成若干子部分，减少了被操作的数据量。同时，基于分桶列的特定操作，如：过滤、JOIN、分组，均可带来性能提升（因为分桶列相同的数据会被划分到同一个分桶文件下）。

比如现在要过滤hadoop，先把hadoop的hash值计算一下，对3取模，然后我们就能知道hadoop具体在哪个桶，最后去对应的分桶文件中寻找即可，避免了对所有文件的扫描。

同理JOIN与过滤都会带来性能提升。

7、修改表

# 重命名

alter table 旧表名 rename to 新表名;

# 修改表属性

alter table 表名 set tblproperties(......)；

alter table table_name set tblproperties('comment'=new_comment)； ：修改表注释

# 添加分区

alter table 表名 add partition(month='202008'); # 无数据

# 修改分区值

修改分区值实质上是对文件夹重命名，这里分两种情况当修改的表是外部表时，只修改元数据记录，HDFS中的文件夹不会改名。当修改的表时内部表时，HDFS中的文件夹会随之一起改名。

alter table 表名 partition(month='202008') rename to partition(month='202001');

# 删除分区

同理删除外部表的分区只会删除元数据，本体数据还在

alter table 表名 drop partition(month='202001');

# 添加列

alter table 表名 add columns(v1 int, v2 string);

# 修改列名

alter table 表名 change 旧列名新列名列类型； # 只可改名不能更改类型

# 删除表

drop table 表名;

# 清空表

truncate table 表名; : 只可以清空内部表，清空外部表会报错

8、复杂类型操作

（1）array类型（数组类型）

如上表有两列name、locations。locations列包含多个城市。name与locations之间制表符分隔，locations中的元素逗号分隔。显然locations是一个array类型。

create table test_array(

name string,

locations array<string>

)row format delimited fields terminated by '\t' collection items terminated by ',';

load data local inpath '/test/3.txt' into table test_array;

select * from test_array;

# 查找zhangsan，lisi第一次工作的地点

select name, locations[0] from test_array; # 序号从0开始

select name, size(locations) from test_array; : 统计zhangsan，lisi去过几个城市

# 查找谁在shanghai工作过

select * from test_array where array_contains(locations, 'shanghai');

（2）map类型（key-value类型）

例如上表中的members字段就是map类型，字段间使用分隔符',';map类型各元素之间用分隔符'#';key-value之间使用分隔符':'。

create table test_map(

id int,

name string,

member map<string,string>,

age int

)row format delimited fields terminated by ','

collection items terminated by '#'

map keys terminated by ':'；

# 查看成员中，每个人的father是谁

select id, name, member['father'], age from test_map;

# 取出map中的全部key，返回array类型

select map_keys(member) from test_map;

# 取出map中的全部value，返回array类型

select map_values(member) from test_map;

# 查看k-y对个数

select size(member) from test_map;

# 查看指定的数据是否包含在map中

select * from test_map where array_contains(map_keys(member), 'sister');

（3）struct类型（复合类型）

struct类型可以在一列中存入多个子列，每个子列可以是不同的数据类型。例如下表中info列：

create table test_struct(

id string,

info struct<name:string, age:int>

)row format delimited fields terminated by '#'

collection items terminated by ':';

select id, info.name from test_struct;

三、数据查询操作

（1）基本查询语句

可以看出在Hive中的查询语句基本于MySQL中一致，只有Cluster by...字段有区别

（2）RLIKE正则匹配

注意\s,\S等具有反斜杠的规则在书写时使用两个\\,即\\s，\\S

# 查找广东省数据

select * from orders where useraddress rlike '.*广东.*'; # .*表示任意字符类似于like中的%

# 查找用户地址为：xx省 xx市 xx区的数据

select * from orders where useraddress rlike '..省 ..市 ..区';

# 查找用户姓张，王，邓

select * from orders where username rlike '[张王邓]\\S+';

# 查找手机号码符合188****0***

select * from orders where userphone rlike '188\\d{4}0\\d{3}';

select * from orders where userphone rlike '188[0-9]{4}0[0-9]{3}';

select * from orders where userphone rlike '188\\S{4}0[0-9]{3}';

（3）union语句

union用于将多个select语句的结果进行联合，每个select语句返回的列的数量和名称必须相同，否则会出错。

语法：

select ...

union [all]

select...

course表

# union写在from中，写在子查询中

select t_id, count(*) from

(

select * from course where t_id = '周杰轮'

union

select * from course where t_id = '王力鸿'

) as u group by t_id;

# union用于insert select

创建一个空的course2表

insert into table course2

select * from course where t_id = '周杰轮'

union

select * from course where t_id = '王力鸿';

（4）表的随机采样tablesample函数

在大数据体系下，真实的企业环境很容易产生很大的表，比如体积达到TB级别的表，对这种表即使执行简单的SELECT语句都会非常慢，哪怕limit10查看10条数据，也会走MapReduce程序，是否不方便。Hive提供了快速抽样的语法，可以快速从大表中随机抽取一些数据供用户查看。

语法1：基于随机分桶抽样

假设tb1表十分庞大，有数千亿的数据，直接查询很慢，因此使用采样操作快速查找。首先将tb1表划分为y个桶，ON后面指定的是分桶规则，如果指定了列则会按照该列进行hash取模分桶，如果指定了rand()则会完全随机的分桶；然后从y个桶中取第x个桶（序号从1开始）的数据供用户查询。

select username, orderid, totalmoney from orders tablesample(bucket 3 out of 10 on username); # 每次查询都一样

select username, orderid, totalmoney from orders tablesample(bucket 3 out of 10 on rand()); # 每次查询都不一样

语法2：基于数据块抽样

该方法不是随机的抽取数据，只是按照数据顺序从前向后取。因此只要条件不变，每次抽样的结果都一致。

select * from orders tablesample(100 rows);

select * from orders tablesample(10 percent);

select * from orders tablesample(1K);

（5）虚拟列

虚拟列是Hive内置的可以在查询语句中使用的特殊标记，可以查询数据本身的详细参数

Hive目前有3个虚拟列

INPUT__FILE__NAME : 显示数据行所在的具体文件

BLOCK__OFFSET__INSIDE__FILE ：显示数据行所在文件的偏移量

ROW__OFFSET__INSIDE__BLOCK ：显示数据所在HDFS块的偏移量，需要设置

set hive.exec.rowoffset=true才可使用。

select *,INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE, ROW__OFFSET__

INSIDE__BLOCK from orders;

select *, BLOCK__OFFSET__INSIDE__FILE from orders where BLOCK__OFFSET

__INSIDE__FILE > 1000;

select INPUT__FILE__NAME, count(*) from orders_bucket group by INPUT__FILE__

NAME; # orders_bucket是一个分桶表，将orders表按username字段划分为10个桶

四、Hive中常见的函数

Hive的函数分为两大类：内置函数和用户定义函数

show functions; : 查看当下可用的所有函数

describe function extended 函数名； ：查看函数使用方法

# 对array类型字段排序sort_array(array类型的字段)

select sort_array(array(2,3,5,7,89,3,2,5,7))； # [2,2,3,3,5,5,7,89]

select sort_array(array('shanxi','beijing','apple'))； # ["apple","beijing","shanxi"]

# 将给定的字符串转换为二进制binary()

select binary('hadoop')；

# 自由类型转换cast(x as y)将x转换为y类型

select cast(1 as double)；

current_timestamp() : 返回当前时间（年月日时分秒）

current_date() ：返回当前日期（年月日）

to_date(string timestamp) ：时间戳转日期

datediff(string a, string b) : 返回a到b之间的天数

coalesce(v1,v2......) : 返回第一个不是NULL的值，如果都是NULL则返回NULL

select COALESCE(null,'asdf','dfadf')； # asdf

select nvl(a,default)； : a有值则返回a本身，否则返回default

if(条件,条件满足时返回的值,条件不满足时返回的值)

select if(2>1,'yes','no')； # yes

nullif(a,b) : 如果a=b则返回NULL否则返回a

assert_true(表达式) ：表达式不成立报错，成立返回NULL

split(字符串，分隔符) ：将字符串进行分隔

select split('hadoop,database,hive,mapreduce',',') # 返回array类型

mask_hash(字符串) : 对字符串进行hash加密，结果是16进制字符串，非字符串加密得到

NULL

hash() ：hash加密，结果是数字

current_user() : 返回当前登录用户

current_database() : 返回当前选择的数据库

version() : 返回当前hive版本

md5() : 进行md5加密

五、案例-聊天数据统计分析

（1）需求分析

给定海量的聊天数据，从聊天数据中挖掘出潜在的有价值的信息。具体来说需要统计以下内容：

统计今日总消息量

统计今日每小时消息量、发送和接收消息用户数

统计今日各地区发送消息数据量

统计今日发送消息和接收消息的用户数

统计今日发送消息最多的Top10用户

统计今日接收消息最多的Top10用户

统计发送人的手机型号分布情况

统计发送人的设备操作系统分布情况

（2）数据准备：5万条 30万条（选择） 1000万条

字段与字段之间使用Hive默认分隔符‘\001’，共20个字段

create database db_msg; : 创建一个新数据库用于本次案例学习

hadoop fs -mkdir -p /chatdemo/data

hadoop fs -put /test/chat_data-30W.csv /chatdemo/data

load data inpath '/chatdemo/data/chat_data-30W.csv' into table tb_msg_source;

select * from tb_msg_source tablesample(10 rows);

select count(*) from tb_msg_source # 可以看出在面对海量数据时MapReduce程序还是

很快的

（3）数据清洗ETL(extract transform load, 抽取、转换、加载)

在数据采集过程中，难免由于设备故障、外界因素干扰，导致数据存在缺失，异常等问题，

因此需要对数据进行清洗（预处理）

问题1：数据缺失

select msg_time, sender_name, sender_gps from tb_msg_source where length(sender_gps) = 0 limit 10;

解决：利用where过滤

问题2：需求中需要统计每天和每小时的消息量，但数据中没有天和小时的字段，只有整体时间段，不好处理，如果天和小时有单独的字段会比较方便。

select msg_time from tb_msg_source limit 10;

解决：date，hour函数提取

问题3：需求中需要对经度和纬度构建地区的可视化地图，但是数据中gps信息是一个字段，不好处理。

select sender_gps from tb_msg_source limit 10;

解决：split函数分隔

解决完上述问题后将清洗后的数据保存到新表tb_msg_etl

insert into table tb_msg_etl

select *,date(msg_time),hour(msg_time),split(sender_gps,',')[0],split(sender_gps,',')[1]

from tb_msg_source

where length(sender_gps)>0;

（4）指标统计

为了方便后续可视化分析，在进行统计分析时需要将统计的结果保存到表中。

create table 表名 as select .......; : 可将select查询的结果直接保存到表中

统计今日总消息量

create table tb_rs_total_msg_cnt comment '每日消息总量' as
select msg_day, count(*) as total_msg_cnt from tb_msg_etl group by msg_day;

统计今日每小时消息量、发送和接收消息用户数

        create table tb_rs_hour_msg_cnt comment '每小时消息量趋势' as
        select msg_hour, count(*) as total_msg_cnt,
           count(distance sender_account) as sender_usr_cnt,
           count(distance receiver_account) as receiver_usr_cnt
        from tb_msg_etl group by msg_hour;

统计今日各地区发送消息数据量

        select
                sender_gps, count(*)
        from tb_msg_etl group by sender_gps; # 后续可视化分析不方便

        create table tb_rs_loc_cnt comment '每日各地区发送消息总量' as
        select
                msg_day, sender_lng, sender_lat, count(*) as total_msg_cnt
        from tb_msg_etl group by msg_day, sender_lng, sender_lat ;

统计今日发送消息和接收消息的用户数

        create table tb_rs_user_cnt comment '每日发送和接收消息的人数' as
        select msg_day,
                count(DISTINCT sender_account) as sender_user_cnt,
                count(DISTINCT receiver_account) as receiver_user_cnt
        from tb_msg_etl group by msg_day ;

统计今日发送消息最多的Top10用户

        create table tb_rs_s_user_top10 comment '发送消息最多的10个用户' as
        select msg_day, sender_account,
                count(*) as sender_msg_cnt
        from tb_msg_etl group by msg_day, sender_account
        order by sender_msg_cnt desc
        limit 10;

统计今日接收消息最多的Top10用户

        create table tb_rs_r_user_top10 comment '接收消息最多的10个用户' as
        select msg_day, receiver_account,
                count(*) as receiver_msg_cnt
        from tb_msg_etl group by msg_day, receiver_account
        order by receiver_msg_cnt desc
        limit 10;

统计发送人的手机型号分布情况

        create table tb_rs_sender_phone comment '发送人手机型号分布' as
        select sender_phonetype, count(*) as cnt
        from tb_msg_etl group by sender_phonetype;

统计发送人的设备操作系统分布情况

        create table tb_rs_sender_os comment '发送人的os分布' as
        select sender_os, count(*) as cnt
        from tb_msg_etl group by sender_os ;

（5）可视化展示

1）BI工具安装（FineBI，SuperSet，PowerBI，TableAu..........市面上有很多很多BI工具）

BI（Business Intelligence，商业智能）工具集很多很多技术于一体，可以实现对复杂数据的分析，统计，可视化等等功能。

下载FineBI商业智能软件 - 自助大数据分析工具