2020-11-20

最新推荐文章于 2022-12-03 15:50:14 发布

寞逍遥

最新推荐文章于 2022-12-03 15:50:14 发布

阅读量456

点赞数

分类专栏： Hive 云计算/大数据

本文链接：https://blog.csdn.net/chenfeng_sky/article/details/109841744

版权

云计算/大数据同时被 2 个专栏收录

24 篇文章 2 订阅

订阅专栏

Hive

4 篇文章 0 订阅

订阅专栏

Hive常用命令

Hive中定义变量

内置命名空间

Hive内置命名空间包含了hivevar、hiveconf、system和env。

在Hive中写入hivevar变量

hive --define/--hivevar key=value

显示变量

set env:HOME
set hivevar:key
set key

给变量赋值

set key=value
set hivevar:key=value

在sql语句中调用变量

create table table_name(i int, ${hivevar:key} string)
create table table_name(i int, ${key} string)

只执行一次的Hive脚本

echo "one row" > /root/path/myfile 
hive -e "load data local inpath '/root/path/myfile' into table table_name"

在Hive中执行shell命令

hive>!pwd

在Hive中使用hadoop

hive>dfs -ls

JDBC时间和UTC时间之间的相互转换

指定了格式的转换

from_unixtime(unix_timestamp('20180930',"yyyyMMdd"),'yyyyMMdd')

为空生成当前的unix时间戳

hive> select unix_timestamp();
unix_timestamp(void) is deprecated. Use current_timestamp instead.
OK
1547438004

转换成当前的UTC时间戳

hive> select from_unixtime(unix_timestamp());
unix_timestamp(void) is deprecated. Use current_timestamp instead.
OK
2019-01-14 13:27:12

UTC时间格式为

YYYY-MM-DD hh:mm:ss.ffffffff

集合数据类型

struct：需要事前定义好格式
map: 类似于struct，但不是事前定义好格式的
array：数组数据类型

create table employees( 
name string, 
salary float, 
subordinates array, 
deductions map<string, float>, 
address struct<street:string,city:string,state:string,zip:int> 
)

Hive数据库的基本操作

指定Hive数据库的路径

create database database_name
location '/my/preferred/directory'

查看Hive数据库的路径

describe database database_name

创建Hive数据库的扩展信息

create database database_name
WITH DBPROPERTIES ('creator' = 'name', 'date' = '2019-01-09');

查看所创建的扩展信息

describe database extended database_name

修改数据库属性

ALTER DATABASE database_name SET DBPROPERTIES ('edited-by' = 'tjm', 'date' = '2019-01-09')

Hive数据表的基本操作

表的创建

create table if not exists employee(
name string comment 'Employee name',
salary float comment 'Employee salary',
subordinates array<string> comment 'Names of subordinates',
deductions map<string, float> comment 'Keys are deductions name,values are percnetages',
address struct<street:string, city:string, state:string,zip:int> comment 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='name', 'date'='2019-01-09')

用复制表结构的方式创建表

create table if not exists employee2
LIKE employee

查看所创建表的基本信息

关键字：show、describe、formatted、extended

show tables in database_name  --查看指定库内的所有表

describe employees  --描述表的字段信息

describe extended employees  --显示所创建表的扩展信息

describe formatted employee2  --最全面的信息查看方式，能够显示出最多的信息

show create table employee  --显示建表时候的建表脚本

show tables 'empl.*' --通过模糊匹配的方式进行查询

使用外部表来建表

外部表的重要作用是当我们对一张表进行删除的时候，仅仅对表结构进行删除，不会影响到建表时使用到的原始数据。

关键字：external

create external table if not exists stocks (
exchange string,
symbol string,
ymd string,
price_open float,
price_high float,
price_low float,
price_close float,
volume int,
price_adj_close float)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

使用分区来管理表

关键字：partitioned by

create table employees(
name string,
salary float,
subordinates array<string>,
deductions map<string, float>,
address struct<street:string, city:string, state:string, zip:int>
)
PARTITIONED BY (country string, state string);

通过分区来载入数据

关键字：load、local

load data local inpath '${env:HOME}/directory' into table employees
partition (country='US',state='CA')

表重命名

关键字：alter

alter table table_name rename to new_table_name;

增加、修改和删除表分区

增加表分区

alter table table_name add if not exists 
partition(year=2011, month=1, day=1) location '/logs/2011/01/01' 
partition(year=2011, month=1, day=2) location '/logs/2011/01/02'
partition(year=2011, month=1, day=3) location '/logs/2011/01/03';

修改表分区

alter table table_name partition(year=2011, month=12, day=2)
set location 'hdfs://ip_address/logs/2011/01/02';

删除表分区

alter table table_name drop if exists 
partition(year=2011, month=1, day=1);

增加和修改表的列信息

修改列

alter table employees
change column name new_name string comment 'change to  new column name'
first --first替换后放到最前，如果使用after，那么紧接列名，表示移动到这个字段之后

增加列

alter table employees add columns(
app_name string comment 'application name',
session_id bigint comment 'session_id name'
)

更新表的属性

alter table employees set tblproperties(
'creator'='name',
'date'='2019-01-10'
)

操作完成后旧表的属性会被新表的属性所覆盖。

数据的导入

向已经创建的表中导入数据

指定的路径是一个目录或者文件，如果创建的表不是分区表那么就不需要添加partition。导入使用了overwrite字段的话，如果要导入的表在数据库中不存在，那么首先进行新建，如果表在数据库中已经存在，就进行重写，如果没有overwrite关键字，仅仅是把新文件增加到目标文件夹中而不会删除之前的数据。

load data local inpath '${env:HOME}/directory' overwrite into table employees
partition (country='US',state='CA')

load data local inpath '${env:HOME}/directory' into table employees
partition (country='US',state='CA')

通过查询语句向表中插入数据

insert overwrite table employees
partition (country='US',state='CA')
select * from staged_employees se
where se.city='US' and se.st='CA'

insert into table employees
partition (country='US',state='CA')
select * from staged_employees se
where se.city='US' and se.st='CA';

这里可以选用overwrite或者into，这两个字段的意义和之前所说的一样。

一次扫描完成全部分区数据的插入

都从原表读取数据的情况。

insert overwrite table sales select * from history where action='purchased'
insert overwrite table credits select * from history where action='returned'

只进行一次查询的情况。

from history 
insert overwrite table sales select * where action='purchased'
insert overwrite table credits select * where action='returned'

区别于对数据完成一次扫描然后进行一次数据插入，我们还可以在进行一次扫描之后完成我们想要的全部数据的插入。

使用单个查询语句创建表并导入数据

关键字：create...as...

create table table_name as 'select no,age from
use_table where age>50'

数据的导出

使用hadoop命令

hadoop fs -cp source_path target_path

这个命令适用于同一个服务器之间的数据拷贝，如果是不同的服务器之间通过hadoop来进行数据的拷贝，如下：

hadoop fs -get file_path --原服务器
hadoop fs -put file_path --目标服务器

使用insert...directory...导出文件到指定目录

insert overwrite/into local directory '/tmp/ca_employees'
select name.salary,address 
from table_name 
where se.cnty-''US and se.st='OR';

同时对多个文件进行导出

from staged_employees se
insert overwrite/into local directory '/tmp/or_employees'
select *  where se.city='US' and se.st='OR';
insert overwrite/into local directory '/tmp/ca_employees'
select *  where se.city='US' and se.st='CA';

Hive中的类型转换

使用关键字cast进行数据类型的转换。

hive> select cast(anum as float) from littlebigdata;
OK
NULL
2.0

此例是将整型数据转换为了浮点型数据。

Hive中的抽样查询

随机抽样

select * from table_name tablesample(bucket 3 out of 10 on rand()) s;

10表示对整个数据划分出的bucket个数，3表示从第三个bucket开始抽样。

基于百分比的数据块抽样

select * from table_name tablesample(0.1 percent) s;

注：这种抽样并不适用于所有文件的格式。

Hive创建查询视图

使用Hive的视图，可以避免select语句的多重嵌套，增加可读性。

不使用视图的脚本

select a.head_no,a.bank_name from (select no, name from table_name) a limit 2;

视图脚本

关键字：create view

create view table_view as (select no,name from table_name)  --创建视图

select no, name from table_view limit 2  --基于视图的查询

查看Hive函数的描述

发现所有函数

hive> show functions;
OK
!
!=
$sum0
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
aes_decrypt
.
.
.

查看指定函数的描述

hive> describe function abs;
OK
abs(x) - returns the absolute value of x

查看带有实例的更详细的描述

hive> describe function extended abs;
OK
abs(x) - returns the absolute value of x
Example:
  > SELECT abs(0) FROM src LIMIT 1;
  0
  > SELECT abs(-5) FROM src LIMIT 1;
  5
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs
Function type:BUILTIN

表生成函数

hive> select array(1,2,3);
OK
[1,2,3]

hive> select explode(array(1,2,3));
OK
1
2
3

自定义Hive文件和记录格式

修改如下语句，替代为指定的格式以更改存储格式。

STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

记录格式：SerDe

SerDe是序列化、反序列化的简写，一个SerDe是将一条记录的非结构化字节转化成Hive可以使用的一条记录的过程。

以ORC文件格式进行存储

ORC格式：

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

参考：《Hive编程指南》

寞逍遥

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
2020-11-20

Hive常用命令Hive中定义变量内置命名空间Hive内置命名空间包含了hivevar、hiveconf、system和env。在Hive中写入hivevar变量hive --define/--hivevar key=value显示变量set env:HOMEset hivevar:keyset key给变量赋值set key=valueset hivevar:key=value在sql语句中调用变量create table table_name(i in
复制链接

扫一扫

专栏目录