creat database
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
例子:
hive> create database d1_hive;
OK
Time taken: 0.939 seconds
hive>
语法与MySQL建库差不多,只是要注意的是,如果你不指定LOCATION属性,那么当前库将会存储在HDFS上的/user/hive/warehouse/database_name.db,如果你指定了那么库将会存储在你指定的路径。
drop database
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
这里需要注意的是如果你要删除的库中有表的话,默认是不允许删除的,如果必须删除的话,请在语句后加入cascade,但是一定要慎用。
create table
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[LOCATION hdfs_path]
[ROW FORMAT row_format]
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables) <-- 新表表结构和数据取绝于select后字段
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name <-- 只是拷贝表结构不涉及到数据
[LOCATION hdfs_path];
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]]
例子:
hive> create table rating_json(json string);
OK
Time taken: 0.939 seconds
hive> select * from rating_json;
OK
Time taken: 1.207 seconds
需要注意的创建表默认是内部表,指定EXTERNAL则是外部表,两者的区别在于删除时内部表hdfs上数据和元数据都会被删除,外部表只删除元数据不会。
alter table
ALTER TABLE table_name RENAME TO new_table_name;
drop table
DROP TABLE [IF EXISTS] table_name [PURGE]; -- (Note: PURGE available in Hive 0.14.0 and later)
load
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
例子:
hive> LOAD DATA LOCAL INPATH '/home/hadoop/data/rating.json' OVERWRITE INTO TABLE rating_json;
Loading data to table d1_hive.rating_json
Table d1_hive.rating_json stats: [numFiles=1, numRows=0, totalSize=63602280, rawDataSize=0]
OK
Time taken: 1.796 seconds
hive>
local指的是Linux服务器上的路径,未指定则是hdfs的路径。
select
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
select用法基本和MySQL一样,需要注意的就是ORDER BY/CLUSTER BY/DISTRIBUTE BY/SORT BY的区别。下列列举区别:
ORDER BY --> 全局排序,转化成的MR会只有一个reduce,数据量大慎用
SORT BY --> 局部排序,生效范围只会在各个reduce内,如果reduce只有一个效果跟ORDER BY一样
DISTRIBUTE BY --> 按照一定的规则把数据分散到某个reducer,可能会造成数据倾斜
CLUSTER BY --> DISTRIBUTE BY + SORT BY
导出
INSERT OVERWRITE LOCAL DIRECTORY 'xxx'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
SELECT * FROM xxx;
或者
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
TO 'export_target_path' [ FOR replication('eventid') ]
内置函数
hive本身自带很多内置函数,我们掌握如何查看以及用法即可。
# 查看全部函数
show functions;
# 不带例子的函数说明
desc function 函数名;
# 带例子的函数说明
desc function extended 函数名;
hive的wc统计
现有wc表,字段只有一个str,表类数据为如下:
str
hello hello hello
world world
welcome
现在要进行wc统计,HQL该怎么写?
# 主要思路就是将行变为列
select word,count(1) as num from
(
select explode(split(sentence,'\t')) as word from wc
) a group by word;
440

被折叠的 条评论
为什么被折叠?



