一、前言
什么是hive?Hive 数据仓库工具,可以把hadoop下原始结构化数据变成hive中表看成sql->Map-Reduce的映射器,提供shell,jdbc/odbc接口
他为数据仓库的管理提供了多功能:数据ETL工具,数据存储管理和大型数据集查询和分析能力
二、Hive 数据存储
hive 的数据存储建立在hadoop 的hdfs 基础上,hive 的每个对应的分区对应 的数据库中的相应分区表的一个索引,一个分区对应表下的一个目录,相应数据存储在对应的目录中。hive 的表和外部表(指向hdfs中存在的数据)有一定的差别。
表的创建在数据加载过程中,实际数据会移动到数据仓库目录中,删除表时,表的数据和元数据将会被同时删除
外部表创建在加载过程中,实际数据并不会移动到数据仓库中,删除外部表仅删除元数据,表中的数据不会删除
三、安装hive
内嵌模式:元数据保持在内嵌的derby模式,只允许一个会话连接
本地狡辩模式:在本地安装mysql,把元数据放到mysql内
远程模式:元数据放置在远程的mysql数据库
下载:wget http://apache.fayea.com/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
设置环境变量:
配置文件:
hive-env.sh
hive-site.xml
启动hive:
cd bin ; ./hive
测试:show table
四、Hive QL
4.1、创建表
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[
[ROW FORMAT row_format] [STORED AS file_format]
| STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ] (Note: only available starting with 0.6.0)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] (Note: only available starting with 0.6.0)
[AS select_statement] (Note: this feature is only available starting with 0.5.0.)
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
LIKE existing_table_name
[LOCATION hdfs_path]
data_type
: primitive_type
| array_type
| map_type
| struct_type
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| STRING
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
row_format
: DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE
| RCFILE (Note: only available starting with 0.6.0)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
1 chen xxxxxxxxx1
2 xiaoming xxxxx2
3 zhangsan xxxxx3
创建表:
create table myuser(id int, name string, address string) row format delimited fields terminated by '\005' stored as textfile
我们的常用间隔符一般是Ascii码5,Ascii码7等。在hive中Ascii码5用’\005’表示, Ascii码7用’\007’表示,依此类推。
装载数据:load data inpath '/input/hive_myuser.txt' overwrite into table myuser
外部表:
create external table myuser_external(id int,name string,address string) row format delimited fields terminated by '\005' stored as textfile location '/input/hive_myuser.txt'
删除表:drop table myuser;
4.2、分区
HIVE的分区通过在创建表时启用partition by实现,用来partition的维度并不是实际数据的某一列,具体分区的标志是由插入内容时给定的。
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE;
create table myuser_partition(id int, name string, address string) partitioned by(dt string) row format delimited fields terminated by '\005' stored as textfile;
外部表可以指定location位置
REATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
4.3、修改表alter table
4.3.1、Add partition
ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
partition_spec:
: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
eg:
Eg:
ALTER TABLE c02_clickstat_fatdt1 ADD
PARTITION (dt='20101202') location '/user/hive/warehouse/c02_clickstat_fatdt1/part20101202'
PARTITION (dt='20101203') location '/user/hive/warehouse/c02_clickstat_fatdt1/part20101203';
4.3.2、Rename table
ALTER TABLE table_name RENAME TO new_table_name
这个命令可以让用户为表更名。数据所在的位置和分区名并不改变。换而言之,老的表名并未“释放”,对老表的更改会改变新表的数据。
4.3.3、Change Column
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
这个命令可以允许改变列名、数据类型、注释、列位置或者它们的任意组合
4.3.4、 add/Replace columns
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
ADD是代表新增一字段,字段位置在所有列后面(partition列前);REPLACE则是表示替换表中所有字段。
Eg:
hive> desc xi;
OK
id int
cont string
dw_ins_date string
Time taken: 0.061 seconds
hive> create table xibak like xi;
OK
Time taken: 0.157 seconds
hive> alter table xibak replace columns (ins_date string);
OK
Time taken: 0.109 seconds
hive> desc xibak;
OK
ins_date string
4.3.5、create view
CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...
4.3.6、show
show tables; 查看表名
show partitions; 查看某表分区
describe tables; 查看某表结构
查看分区内容
SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
查看有限行内容,同Greenplum,用limit关键词
SELECT a.foo FROM invites a limit 3;
查看表分区定义
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');
4.6、加载数据Load
语法:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Load 操作只是单纯的复制/移动操作,将数据文件移动到hive的相应的位置
从本地导入数据到表格并追加原表
LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE c02 PARTITION(date='2008-06-08', country='US')
</pre>从hdfs 导入数据到表格并覆盖原表<p></p><pre name="code" class="java">LOAD DATA INPATH '/user/admin/SqlldrDat/CnClickstat/20101101/18/clickstat_gp_fatdt0/0' INTO table c02_clickstat_fatdt1 OVERWRITE PARTITION (dt='20101201');
4.7、Insert
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
导出文件到本地
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
导出文件到HDFS
INSERT OVERWRITE DIRECTORY '/user/admin/SqlldrDat/CnClickstat/20101101/19/clickstat_gp_fatdt0/0' SELECT a.* FROM c02_clickstat_fatdt1 a WHERE dt=’20101201’;
一个源可以同时插入到多个目标表或目标文件,多目标insert可以用一句话来完成
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
tip:
Hive不支持一条一条的用insert语句进行插入操作,也不支持update的操作。数据是以load的方式,加载到建立好的表中。数据一旦导入,则不可修改。要么drop掉整个表,要么建立新的表,导入新的数据。
4.8、Hive Select
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]
4.8.1、order by
colOrder: ( ASC | DESC )
orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy
4.8.2、Join
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN[] table_reference join_condition
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
所以你能这样写
select a.*,b.* from myuser a join myuser b on (a.id = b.id);
但不能这样写:
select a.*,b.* from myuser a, myuser b where a.id = b.id;
2、join 可以多于2个表
3、join 通过MapReduce 任务,将大表放到条件最后,有减少缓存的数量
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
第一次map是a.key= b.key2,第一次结束后,会缓存结果和c进行map/reduece
4、Left semi join 可以替代IN/Exists
五、Hive UDF
SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
UDF函数的限制
1、select 里面不能有其他字段
SELECT pageid, explode(adid_list) AS myCol.
2、不能嵌套
ELECT explode(explode(adid_list)) AS myCol...不支持
3、不支持GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BYSELECT explode(adid_list) AS myCol ... GROUP BY myCol
eg:UDF 函数Explode 将数组进行转置
六、Hive tip
1、hadoop 和hive 是使用utf-8 编码
2、count(distinct)
当前的 Hive 不支持在一条查询语句中有多 Distinct
3、join 只支持等值连接
4、dml 只支持INSERT/LOAD操作,无UPDATE和DELTE
5、不支持having
6、Join 中null是有意义的
7、不支持HAVING操作。如果需要这个功能要嵌套一个子查询用where限制
8、如何插入一个数据并且不删除原来的数据
INSERT OVERWRITE TABLE xiaojun1
SELECT id, value FROM (
SELECT id, value FROM xiaojun1
UNION ALL
SELECT 4 AS id, 5 AS value FROM xiaojun1 limit 1
) u;
9、类型一致
INSERT OVERWRITE TABLE在插入数据时, 后面的字段的初始值应注意与表定义中的一致性. 例如, 当为一个STRING类型字段初始为NULL时:
NULL AS field_name // 这可能会被提示定义类型为STRING, 但这里是void
CAST(NULL AS STRING) AS field_name // 这样是正确的
又如, 为一个BIGINT类型的字段初始为0时:
CAST(0 AS BIGINT) AS field_name