[转载]---HiveQL详解

最新推荐文章于 2024-02-21 14:53:49 发布

weixin_33806509

最新推荐文章于 2024-02-21 14:53:49 发布

阅读量145

点赞数

文章标签：大数据 shell

HiveQL是一种类似SQL的语言，它与大部分的SQL语法兼容，但是并不完全支持SQL标准，如HiveQL不支持更新操作，也不支持索引和事务，它的子查询和join操作也很局限，这是因其底层依赖于Hadoop云平台这一特性决定的，但其有些特点是SQL所无法企及的。例如多表查询、支持create table as select和集成MapReduce脚本等，本节主要介绍Hive的数据类型和常用的HiveQL操作。

一、hive client命令

1.hive命令参数

-e: 命令行sql语句
-f: SQL文件
-h, --help: 帮助
--hiveconf: 指定配置文件
-i: 初始化文件
-S, --silent: 静态模式（不将错误输出）
-v,--verbose: 详细模式

2.交互模式

hive> show tables; #查看所有表名
hive> show tables  'ad*'  #查看以'ad'开头的表名
hive> set 命令 #设置变量与查看变量；
hive> set -v #查看所有的变量
hive> set hive.stats.atomic #查看hive.stats.atomic变量
hive> set hive.stats.atomic=false #设置hive.stats.atomic变量
hive> dfs  -ls #查看hadoop所有文件路径
hive> dfs  -ls /user/hive/warehouse/ #查看hive所有文件
hive> dfs  -ls /user/hive/warehouse/ptest #查看ptest文件
hive> source file <filepath> #在client里执行一个hive脚本文件
hive> quit #退出交互式shell
hive> exit #退出交互式shell
hive> reset #重置配置为默认值
hive> !ls #从Hive shell执行一个shell命令

二、操作及函数

查看函数：
hive> show  functions;   
正则查看函数名：
show  functions  'xpath.*';  
查看具体函数内容：
describe function xpath; | desc function  xpath;

三、字段类型
Hive支持基本数据类型和复杂类型，基本数据类型主要有数值类型(INT、FLOAT、BOUBLE)、布尔型和字符型，复杂类型有三种：ARRAY、MAP和STRUCT。

1.基本数据类型

TINYINT：1个字节

SMALLINT：2个字节

INT：4个字节

BIGINT：8个字节

BOOLEAN：TRUE/FALSE

FLOAT：4个字节，单精度浮点型

BOUBLE：8个字节，双精度浮点型

STRING 字符串

2.复杂数据类型

ARRAY：有序字段

MAP：无序字段

STRUCT：一组命名的字段

四、表类型

Hive表大致分为普通表、外部表、分区表三种。

1.普通表

----创建普通表，并且中间以空格隔开，存储格式为textfile---- CREATE TABLE test_1(id INT, name STRING, city STRING) SORTED BY TEXTFILE ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’  ;

创建表
hive> create table tb_person(id int, name string);

创建表并创建分区字段ds
hive> create table tb_stu(id int, name string) partitioned by(ds string);

查看分区
hive> show  partitions tb_stu;

显示所有表
hive> show tables;

按正则表达式显示表，
hive> show tables 'tb_*';

表添加一列
hive> alter table tb_person add columns (new_col int);

添加一列并增加列字段注释
hive> alter table tb_stu add columns (new_col2 int comment 'a comment');

更改表名
hive> alter table tb_stu rename to tb_stu;

删除表(hive只能删分区，不能删记录或列 )
hive> drop table tb_stu;

对于托管表, drop 操作会把元数据和数据文件删除掉, 对于外部表, 只是删除元数据。如果只要删除表中的数据, 保留表名可以在 HDFS 上删除数据文件:
hive> dfs –rmr /user/hive/warehouse/mutill1/*

数据加载的例子：

将本地/home/hadoop/ziliao/stu.txt文件中的数据加载到表中，stu.txt文件数据如下：
1 zhangsan

2 lisi

3 wangwu

将文件中的数据加载到表中

hive> load data local inpath '/home/hadoop/ziliao/stu.txt' overwrite into table tb_person;

加载本地数据，同时给定分区信息

hive> load data local inpath '/home/hadoop/ziliao/stu.txt' overwrite into table tb_stu partition (ds='2008-08-15');

备注：如果导入的数据在HDFS上，则不需要local关键字。托管表导入的数据文件可在数据仓库目录"/user/hive/warehouse/<tablename>"中看到。
查看数据

hive> dfs -ls /user/hive/warehouse/tb_stu
hive> dfs -ls /user/hive/warehouse/tb_person

2.外部表
external关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径(location)，hive创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

eg.创建外部表：

create external table tb_record(col1 string, col2 string) row format delimited fields terminated by '\t' location '/user/hadoop/input';

这样表tb_record的数据就是hdfs://user/hadooop/input/*的数据了。

3.分区表

分区是表的部分列的集合，可以为频繁使用的数据建立分区，这样查找分区中的数据时就不需要扫描全表，这对于提高查找效率很有帮助。

创建分区：create table log(ts bigint,line string) partitioned by(name string);

插入分区：insert overwrite table log partition(name='xiapi') select id from userinfo where name = 'xiapi';

查看分区：show partitions log;

删除分区：alter table ptest drop partition(name = 'xiapi')

(删除分区：alter table xxx drop if exists partition(year=2012,month=10, day=1))

备注：通常情况下需要先预先创建好分区，然后才能使用该分区。还有分区列的值要转化为文件夹的存储路径，所以如果分区列的值中包含特殊值，如'%'，':'，'/'，'#'，它将会被使用%加上2字节的ASCII码进行转义。

五、sql操作及桶

1.创建表

首先建立三张测试表：

userinfo表中有两列，以tab键分割，分别存储用户的id和名字name；

classinfo表中有两列，以tab键分割，分别存储课程老师teacher和课程名classname；

choice表中有两列，以tab键分割，分别存储用户的userid和选课名称classname(类似中间表)。

创建测试表：

hive> create table userinfo(id int,name string) row format delimited fields terminated by '\t';
hive> create table classinfo(teacher string,classname string) row format delimited fields terminated by '\t';
hive> create table choice(userid int,classname string) row format delimited fields terminated by '\t';

注：'\t'相当于一个tab键盘。
显示刚才创建的数据表：

hive> show tables;

2.导入数据

建表后，可以从本地文件系统或HDFS中导入数据文件，导入数据样例如下：

userinfo.txt内容如下(数据之间用tab键隔开)：

1 xiapi

2 xiaoxue

3 qingqing

classinfo.txt内容如下(数据之间用tab键隔开)：

jack math

sam china

lucy english

choice.txt内容如下(数据之间用tab键隔开)：

1 math

1 china

1 english

2 china

2 english

3 english

首先在本地"/home/hadoop/ziliao"下按照上面建立三个文件，并添加如上的内容信息。

3.按照下面导入数据：

hive> load data local inpath '/home/hadoop/ziliao/userinfo.txt' overwrite into table userinfo;
hive> load data local inpath '/home/hadoop/ziliao/classinfo.txt' overwrite into table classinfo;
hive> load data local inpath '/home/hadoop/ziliao/choice.txt' overwrite into table choice;

查询表数据

hive> select * from userinfo;
hive> select * from classinfo;
hive> select * from choice;

4.分区

a.创建分区
hive> create table ptest(userid int) partitioned by (name string) row format delimited fields terminated by '\t';
b.准备导入数据
xiapi.txt内容如下(数据之间用tab键隔开)：
1    
c.导入数据
hive> load data local inpath '/home/hadoop/ziliao/xiapi.txt' overwrite into table ptest partition (name='xiapi');
d.查看分区
hive> dfs -ls /user/hive/warehouse/ptest/name=xiapi;
e.查询分区
hive> select * from ptest where name='xiapi';
f.显示分区
hive> show partitions ptest;
g.对分区插入数据(每次都会覆盖掉原来的数据):
hive> insert overwrite table ptest partition(name='xiapi') select id from userinfo where name='xiapi';
h.删除分区
hive> alter table ptest drop partition (name='xiapi')

5.桶
可以把表或分区组织成桶，桶是按行分开组织特定字段，每个桶对应一个reduce操作。在建立桶之前，需要设置"hive.enforce.bucketing"属性为true，使hive能够识别桶。

在表中分桶的操作如下：

hive> set hive.enforce.bucketing=true;
hive> set hive.enforce.bucketing;
hive.enforce.bucketing=true;
hive> create table btest2(id int, name string) clustered by(id) into 3 buckets row format delimited fields terminated by '\t';

向桶中插入数据，这里按照用户id分了三个桶，在插入数据时对应三个reduce操作，输出三个文件。
hive>insert overwrite table btest2 select * from userinfo;

查看数据仓库下的桶目录，三个桶对应三个目录。

hive>dfs -ls /user/hive/warehouse/btest2;

Hive使用对分桶所用的值进行hash，并用hash结果除以桶的个数做取余运算的方式来分桶，保证了每个桶中都有数据，但每个桶中的数据条数不一定相等，如下所示。

hive>dfs -cat /user/hive/warehouse/btest2/*0_0;
hive>dfs -cat /user/hive/warehouse/btest2/*1_0;
hive>dfs -cat /user/hive/warehouse/btest2/*2_0;

分桶可以获得比分区更高的查询效率，同时分桶也便于对全部数据进行采样处理。下面是对桶取样的操作。
hive> select * from btest2 tablesample(bucket 1 out of 3 on id);

六、多表插入

多表插入指的是在同一条语句中，把读取的同一份元数据插入到不同的表中。只需要扫描一遍元数据即可完成所有的插入操作，效率很高。多表操作示例如下。

hive> create table mutill as select id,name from userinfo; #有数据
hive> create table mutil2 like mutill; #无数据，只有表结构
hive> from userinfo insert overwrite table mutill
      select id,name insert overwrite table mutil2 select count(distinct id),name group by name;

七、连接

连接是将两个表中在共同数据项上相互匹配的那些行合并起来，HiveQL连接分为内连接、左向外连接、右向外连接、全外连接和半连接5种。

1.内连接(等值连接)

内连接使用比较运算符根据每个表共有的列的值匹配两个表中的行。

例如，检索userinfo和choice表中标识号相同的所有行。

hive> select userinfo.*, choice.* from userinfo join choice on(userinfo.id=choice.userid);

2.左连接
左连接的结果集包括"LEFT OUTER"子句中指定的左表的所有行，而不仅仅是连接列所匹配的行。如果左表的某行在右表中没有匹配行，则在相关联的结果集中右表的所有选择列均为空值。

hive> select userinfo.*, choice.* from userinfo left outer join choice on(userinfo.id=choice.userid);

3.右连接

右连接是左向外连接的反向连接，将返回右表的所有行。如果右表的某行在左表中没有匹配行，则将为左表返回空值。

hive> select userinfo.*, choice.* from userinfo right outer join choice on(userinfo.id=choice.userid);

4.全连接
全连接返回左表和右表的所有行。当某行在另一表中没有匹配行时，则另一个表的选择列表包含空值。如果表之间有匹配行，则整个结果集包含基表的数据值。

hive> select userinfo.*, choice.* from userinfo full outer join choice on(userinfo.id=choice.userid);

5.半连接
半连接是Hive所特有的，Hive不支持IN操作，但是拥有替代的方案；left semi join，称为半连接，需要注意的是连接的表不能在查询的列中，只能出现在on子句中。

hive> select userinfo.* from userinfo left semi join choice on (userinfo.id=choice.userid);

八、子查询
标准SQL的子查询支持嵌套的select子句，HiveQL对子查询的支持很有限，只能在from引导的子句中出现子查询。如下语句在from子句中嵌套了一个子查询(实现了对教课最多的老师的查询)。

hive>select teacher,MAX(class_num)
         from (select teacher, count(classname) as class_num from classinfo group by teacher)  subq
         group by teacher;

weixin_33806509

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[转载]---HiveQL详解

HiveQL是一种类似SQL的语言，它与大部分的SQL语法兼容，但是并不完全支持SQL标准，如HiveQL不支持更新操作，也不支持索引和事务，它的子查询和join操作也很局限，这是因其底层依赖于Hadoop云平台这一特性决定的，但其有些特点是SQL所无法企及的。例如多表查询、支持create table as select和集成MapReduce脚本等，本节主要介绍Hive的数据类型和常用的Hiv...
复制链接

扫一扫