Hive学习笔记

关于O'Reilly的《Programming Hive》的学习笔记

一、数据类型
1.基本类型

tinyint、smallint、int、bigint、boolean、float、double、string、timestamp、binary

和其他数据库一样,都是大小写不敏感的。而且,它们的实现都是基于java实现的。binary和其他数据库的varbinary类似,但是和blog不一样,binary存储的都是字节。

类型转换
hive会自动进行类型转换,但是都是往高位转换,例如tinyint-》smallint。如果需要显示转换,可以这样使用
select cast(s as int) from table_name;


2.集合类型
struct、map、array


3.文本文件编码
hive支持cvs格式
hive默认的记录和域分隔符

例子:

CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;


二、DDL
如果没有指定数据库,就默认是defualt数据库

1.创建数据库看

create database database_name;


create database database_name if not exists database_name;

可以使用schema代替database

2.查看database
show databases;


describe database database_name;

支持正则表达式
show databases like 'def*'

3.存储位置
默认情况下,数据库目录被创建在
属性hive.metastore.warehouse.dir指定的目录。当然可以显示指定另外存储位置

create database database_name
location '/home/mclaren/databases'


4.增加注释
create table database_name
comment 'the database_name i create';


另一种新增注释的方法,可以添加key-value到数据库。这样只有describe database extended database_NAME可以看到
create database database_name
with dbproperties('create'='mclaren pan', 'data'='2013-12-01');


describe database extended database_name;

5.转换工作数据库
use database_name;

6.设置属性
可以在进入hive命令行模式下设置属性
set hive.cli.print.current.db=true

7.删除数据库
drop database if exists database_name;

8.修改数据库
alert database database_name set dbproperties ('hehe', 'shit');

9.建表
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';

10.外部表
这里的外部就是表的location不再本地,可以在hdfs的任何地方
例子:

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

这里的外部是/data/stocks
到底是受管表还是外部表,可以使用describe extended table_name看到。
外部表的一些特性:
1.因为是外部的,所以hive并不拥有数据,删除表并不会删除数据,但是metadata会被删除。
2.有些HiveQL的构造不能使用外部表

和受管表一样,schema(但是数据不行)也一样可以复制
例如
create external table if not exists mydb.employee2
like mydb.employee
location '/home/mclaren/data';

11.分区管理表
分区主要是为了水平地分布式加载,移动用户常用的数据到跟接近它们的地方。或是其他目的
使用分区会以要分区的那个字段的具体值来命名一个子目录,例如
create tabel table1(
  col1 string,
  col2 string,
  col3 int,
  col4 struct<subcol1:string, subcol2:string>
)
partitioned by (col1 string, subcol2 string)

如下目录结构会被创建
%WAREHOUSE%/table1
%WAREHOUSE%/table1/col1=**/subcol2=**
%WAREHOUSE%/table1/col1=**/subcol2=***
%WAREHOUSE%/table1/col1=***/subcol2=**
%WAREHOUSE%/table1/col1=***/subcol2=***
这样做的好处是为了提升查询效率,如果没有指定where查询条件,就会扫描全部的子目录,一般这种情况很少,但是一旦发生,就会触发一个巨大的MapReduce作业,这会是非常消耗资源的。为了避免这种情况发生。可以设置属性。
set hive.mapred.mode=strict;

这样设置后,如果进行无where查询会爆出
FAILED: Error in semantic analysis: No partition predicate found for
Alias "*" Table "****"
恢复这样设置
set hive.mapred.mode=nostrict

查看分区
show partitions table_name;
show partitions table_name partition(column='***')
describe extended table_name;

12.外部表分区
create external table if not exists ext_part_tbl (
  col1 string,
  col2 string,
  col3 string
)
partitioned  by (col4 string, col5 string)
row format delimited fields terminated by '\t';
alter table table_name add partition(col4='**', col5='**') 

13.定制表存储格式
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

TEXTFILE意味着每一行就是一个数据,除了TEXTFILE还可以使用SEQUENCEFILE和RCFILE,后面两个都被磁盘和网络IO优化过。除此之外,还可以定制输入处理类和输出处理类。
CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES ('schema.url'='http://schema_provider/kst.avsc')
STORED AS
INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat';

14.删除表
drop table if exists table_name;

15.修改表
ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'
...;

16.修改列
alter table table_name
change column old_col new_col string
comment 'rename column old_col to new_col'
after other_col;

移动new_col到other_col后

alter table table_name
change column old_col new_col string
comment 'rename column old_col to new_col'
first other_col;

17.增加列
alert table table_name add columns (
  col1 string comment 'first col',
  col2 string comment 'second col'
);

18.删除列
alter table table_name replace columns (
  col1 string comment 'first col',
  col2 string comment 'second col'
)

19.修改表属性
alter table table_name set tableproperties (
  'prop1'='this is prop1'
)


20.修改存储属性
alter table table_name
partition (col1=**,col2=*)
set fileformat sequencefile;


ALTER TABLE table_using_JSON_storage
SET SERDE 'com.example.JSONSerDe'
WITH SERDEPROPERTIES (
'prop1' = 'value1',
'prop2' = 'value2');

阻止分区被删掉
alter table table_name
partition(col1=**, col2=**) enable no_drop;

阻止分区被查询
alter table table_name
partition(col1=*, col2=*) enable offline;

三、DML
1.加载数据到受管表
load data local inpath '${env:HOME}/california-employees'
overwirte into table employees
partition (country = 'US', state = 'CA')

这命令首先创建这个分区的目录,然后复制数据到这个目录
注意:
1.如果表没有分区,就不要使用partition语句
2.如果没有使用关键字local,则不会从本地文件系统取得文件放到目标目录,而是从分布式文件系统中取得。
3.不能从一个集群拷贝到另外一个集群。

2.从查询语句插入数据到表
insert overwrite table employees
partition (county = 'US', state = 'OR')
select * from staged_employeess se
where se.cnty = 'US' and se.st = 'OR'

from staged_employees se
insert overwrite table employees
partition (county = 'US', state = 'OR')
select * where se.cnty = 'US' and se.st = 'OR'
insert overwrite table employees
partition (county = 'US', state = 'CA')
select * where se.cnty = 'US' and se.st = 'CA'

3.动态分区插入
上上面那样会有多条sql,

insert overwrite table employees
partition (county = 'US', state)
select ..., se.cnty, se.st
from staged_employees se
where se.cnty = 'US'

4.创建表并且通过query语句加载数据

create table ca_employees
as select name, salary, address
from employees
where se.state = 'CA'

create table ca_employees
as select name, salary, address
from employees
where se.state = 'CA'

5.导出数据
方式一:
使用hadoop命令
insert overwrite local directory '/home/mclaren/data'
select name, salary, address
from employees
where se.state = 'CA';

hadoop fs -cp source_ath target_path


hadoop fs -cp source_ath target_path

方式二:
insert overwrite local directory '/home/mclaren/data'
select name, salary, address
from employees
where se.state = 'CA';

FROM staged_employees se
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'IL';

6.删除数据,但保留表结构
也就是删除warehouse目录下的子文件夹
在hive命令行下:



四、HiveQL 查询
如果字段是Array,则可以使用 字段[索引]来取得数组中的某个元素
如果字段是Map,则可以使用 字段[key] 来取的某个元素
dfs -rmr /home/mclaren/datascope/hvie/warehouse/t_order;

用正则表达式指定列
例如:
select symbol, 'price.*' from stocks;

计算列值可以使用到的列值


可以通过设置提升聚合性能
set hive.map.aggr=true
这个设置将会导致一个顶级的聚合map


分页

SELECT upper(name), salary, deductions["Federal Taxes"],
 round(salary * (1 - deductions["Federal Taxes"])) FROM employees
 LIMIT 2;

列别名
SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
 round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
 FROM employees LIMIT 2;

嵌套语句
FROM (
 SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
 round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
 FROM employees
 ) e
 SELECT e.name, e.salary_minus_fed_taxes
 WHERE e.salary_minus_fed_taxes > 70000;


case when then
SELECT name, salary,
 CASE
 WHEN salary < 50000.0 THEN 'low'
SELECT … FROM Clauses | 91
 WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
 WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
 ELSE 'very high'
 END AS bracket FROM employees;

是否触发MapReduce
在hive中,有些查询没有用到MapReduce,这就是所谓的本地模式
例如下面查询
select * from employees;

当where语句中以分区字段过滤时


当像如下设置的时候
set hive.exec.mode.local.auth=true

hive将按照本地模式运行
否则会使用MapReduce

where语句
注意:不能在where语句中使用列别名


like 和 rlike
rlike是正则表达式

SELECT name, address.street
 FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';

group by 语句
SELECT year(ymd), avg(price_close) FROM stocks
 WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
 GROUP BY year(ymd);

having语句
SELECT year(ymd), avg(price_close) FROM stocks
 WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
 GROUP BY year(ymd)
 HAVING avg(price_close) > 50.0;

join语句
SELECT a.ymd, a.price_close, b.price_close
 FROM stocks a JOIN stocks b ON a.ymd = b.ymd
 WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';

SELECT a.ymd, a.price_close, b.price_close , c.price_close
 FROM stocks a JOIN stocks b ON a.ymd = b.ymd
 JOIN stocks c ON a.ymd = c.ymd
 WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';

左外连接
SELECT s.ymd, s.symbol, s.price_close, d.dividend
 FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
 WHERE s.symbol = 'AAPL';

右外联结
SELECT s.ymd, s.symbol, s.price_close, d.dividend
 FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
 WHERE s.symbol = 'AAPL';

全连接
SELECT s.ymd, s.symbol, s.price_close, d.dividend
 FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
 WHERE s.symbol = 'AAPL';

左半连接
SELECT s.ymd, s.symbol, s.price_close
 FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值