[Hadoop] Hive 库表基本操作

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/wawa8899/article/details/80322019

Hive 库表基本操作

创建数据库

hive> create database if not exists db1;
hive> create schema if not exists db2;

删除数据库

hive> drop database db2;
hive> drop schema db1;

创建表

CREATE TABLE IF NOT EXISTS employee ( 
eid int, 
name String,
salary String, 
job string,
year int)
COMMENT 'Employee details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

导入数据进表

准备数据文件sample.txt

[root@g12-1 ~]# cat /tmp/sample.txt 
1201	Gopal	45000	TechnicalManager	2013	
1202	Manisha	45000	ProofReader	2013
1203	Masthanvali	40000	TechnicalWriter	2014
1204	Kiran	40000	HrAdmin	2014
[root@g12-1 ~]#

导入数据进表

hive> LOAD DATA LOCAL INPATH '/tmp/sample.txt' OVERWRITE INTO TABLE employee;
Loading data to table db1.employee
Table db1.employee stats: [numFiles=1, numRows=0, totalSize=150, rawDataSize=0]
OK
Time taken: 0.354 seconds
hive> select * from employee;
OK
1201	Gopal	45000	TechnicalManager	2013
1202	Manisha	45000	ProofReader	2013
1203	Masthanvali	40000	TechnicalWriter	2014
1204	Kiran	40000	HrAdmin	2014
Time taken: 0.094 seconds, Fetched: 4 row(s)
hive>

HiveQL

SELECT...WHERE

hive> select * from employee where salary > 40000;

ORDER BY

hive> select * from employee order by eid;

GROUP BY

hive> select salary,count(salary) from employee group by salary;

SELECT...JOIN

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);


分区表

    Hive的数据库是目录,表也是目录,分区表表目录的子目录

    create table xx(...) partitioned by()

    alter table xxx add partitions() ...

    load data local inpath ... into table xxx partions (...);

bucket表(桶表)

    create table xxx(...)  ... clustered by (fileName) into n buckets;    

    桶表是数据文件.hash


HiveQL调优

1)explain 解释执行计划

    explain extended select count(*) from employee;

    explain formatted select count(*) from employee;

2)启用limit调优,避免全表扫描,使用抽样机制

    select * from employee limit 1,2;

    配置hive.limite.optimize.enable=true

3)JOIN

    使用map端链接(/*+ streamtable(table) */)

    连接查询表的大小是从左至右一次增长。

4)设置本地模式,在单台机器上处理所有任务

    使用小数据情况

    hive.exec.mode.local.auto=true     //默认false

    hive> set hive.exec.mode.local.auto=true;

    ...    


阅读更多
想对作者说点什么?

博主推荐

换一批

没有更多推荐了,返回首页