Hive的数据导入导出，插入，加载

最新推荐文章于 2024-01-14 23:55:07 发布

wulicode

最新推荐文章于 2024-01-14 23:55:07 发布

阅读量1.1k

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/shujuwangzi/article/details/40143247

版权

Hive 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

简介

用户接口，包括 CLI，JDBC/ODBC，WebUI

元数据存储，通常是存储在关系数据库如 mysql, derby 中

解释器、编译器、优化器、执行器

Hadoop：用 HDFS 进行存储，利用 MapReduce 进行计算

l 用户接口主要有三个：CLI，JDBC/ODBC和 WebUI

CLI，即Shell命令行

JDBC/ODBC 是 Hive 的Java，与使用传统数据库JDBC的方式类似

WebGUI是通过浏览器访问 Hive

l Hive 将元数据存储在数据库中(metastore)，目前只支持mysql、derby。Hive 中的元数据包括表的名字，表的列和分区及其属性，表的属性（是否为外部表等），表的数据所在目录等

l 解释器、编译器、优化器完成 HQL 查询语句从词法分析、语法分析、编译、优化以及查询计划（plan）的生成。生成的查询计划存储在HDFS 中，并在随后有MapReduce 调用执行

l Hive 的数据存储在HDFS 中，大部分的查询由MapReduce 完成（包含 * 的查询，比如select * from table 不会生成 MapRedcue 任务）

导出数据 [root@cloud4 shell]# hive -e 'select * from t1' > test.txt

排序问题：

order by  (全局排序 )
order by 会对输入做全局排序，因此只有一个reducer（多个reducer无法保证全局有序）
只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。
在hive.mapred.mode=strict模式下，强制必须添加limit限制，这么做的目的是减少reducer数据规模
例如，当限制limit 100时，如果map的个数为50，则reducer的输入规模为100*50
distribute by  (类似于分桶)
根据distribute by指定的字段对数据进行划分到不同的输出reduce 文件中。
sort by   (类似于桶内排序)
sort by不是全局排序，其在数据进入reducer前完成排序。
因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。
cluster by
cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。
但是排序只能是倒序排序，不能指定排序规则为asc 或者desc。

因此，常常认为cluster by = distribute by + sort by

1.创建表

STORED AS TEXTFILE 存储方式

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 分隔符

create table T_shifx2(id int,name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

2.加载数据

Overwrite 覆盖原有数据

load data local inpath '/home/ffcs/hive-0.13.0-bin/examples/files/F_shifx1.txt' overwrite into table T_shifx2;

分区数据加载

load data local inpath '/home/ffcs/hive-0.13.0-bin/examples/files/F_p_employees1.txt' overwrite into table t_p_employees partition(country='China',state='lianbang');

3.创建分区

create table t_p_employees(name string,salary float,address STRUCT<city:string,state:string>)

partitioned by (country string,state string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

COLLECTION ITEMS TERMINATED BY ':';

hive提供了复合数据类型：
Structs： structs内部的数据可以通过DOT（.）来存取，例如，表中一列c的类型为STRUCT{a INT; b INT}，我们可以通过c.a来访问域a
Maps（K-V对）：访问指定域可以通过["指定域名称"]进行，例如，一个Map M包含了一个group-》gid的kv对，gid的值可以通过M['group']来获取
Arrays：array中的数据为相同类型，例如，假如array A中元素['a','b','c']，则A[1]的值为'b'

STRUCT<city:string,state:string> 一个字段中有多个属性

partitioned by (country string,state string) 多个分区

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 各字段之间的分隔符

COLLECTION ITEMS TERMINATED BY ':'; 每个字段间多个属性之间的分隔符

create table t_employees(id int,name string) partitioned by(dept string)

> row format delimited

> fields terminated by ','

> stored as textfile

> location '/user/hive/warehouse/mydb.db/examples';

hive array、map、struct使用

http://www.cnblogs.com/end/archive/2013/01/17/2863884.html

4.增加分区

alter table t_p_employees add partition(country='China',state='lianhe');

5.删除分区

ALTER TABLE page_view

DROP PARTITION (dt='2008-08-08', country='us');

查看分区：show table_name partitions;

6.查询

（1）TOP查询

select * from p_emp sort by desc limit 2;

（2）分区查询

select * from p_emp where p_emp.state='fuzhou';

（3）连接查询（内连接）

select p_emp.* from p_emp join p_emp2 on p_emp.a=p_emp2.a;

select * from emp join emp2 on emp.id=emp2.id;

默认将大表放在后面，以增加查询速度

select /*+streamtable(s)*/ emp.id,emp.name,emp2.name from emp join emp2 on emp.id=emp2.id;

连接查询（外连接）

左外链接

select emp.id,emp.name,emp2.name from emp left outer join emp2 on emp.id=emp2.id;

全外连接

select emp.id,emp2.id,emp.name,emp2.name from emp full outer join emp2 on emp.id=emp2.id;

Semi join 连接提高速度只显示左表中的信息

select emp.id,emp.name from emp left semi join emp2 on emp.id=emp2.id;

7.查看hive中创建表的位置：

hadoop fs -ls /user/hive/warehouse/mydb.db/t_e_emp/dept=IT_BG;

8.导出数据

insert overwrite local directory '/home/ffcs/hive-0.13.0-bin/examples/files/F_shifx1.txt'

> select * from emp;

9.通过查询语句向表中插入数据

insert overwrite table emp3 partition (dept='IT-BG',country='American') select * from emp4;

emp3有两个字段不包含分区的字段；

emp4也必须为两个字段包含分区字段；

10.权限管理

查看系统用户名

set systerm:user.name;

11.表的生成函数

select explode(array(1,2,3)) fromdual;

12.进制转换

select conv(17,10,2) fromdual;

13.函数嵌套

select rpad(reverse(cast(conv(fd0003,10,16) as string)),10,'0') from zte_1x_cdt_tbl limit 4 ;

select concat(rpad(reverse(cast(conv(fd0002,10,16) as string)),8,'0'),rpad(reverse(cast(conv(fd0003,10,16) as string)),7,'0')) as cur_subscriber_number from zte_1x_cdt_tbl ;

lpad(called_number,4,'') //取左边四个，多于四个截取，少于四个补’’

14.由16进制的值求对应的字符

Unhex(30)=’0’

15获取UNIX时间

select unix_timestamp() fromdual;

获取日期

select from_unixtime(1405478263,'yyyy-m-dd--HH:mm:ss') fromdual;

16.保留小数位数

Format_number(3.56777,3)即3.57

17.指数与对数运算

format_number(10*log10(power(10,((-0.5)*fd0043)/10)/(1-power(10,((-0.5)*fd0043)/10))),4)

18.行转列

Col3数据为1,2,3

4,5,6

select col1,col2,name

from test_jzl_20140701_test

lateral view explode(split(col3,',')) col3 as name;

a b 1

a b 2

a b 3

c d 4

c d 5

c d 6

19.列转行

select col1,col2,concat_ws(',',collect_set(col3))
from tmp_jiangzl_test
group by col1,col2;

20.权限管理

grant all on database mydb to user ffcs;

给mydb数据库授权拥有所有权限

wulicode

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive的数据导入导出，插入，加载

简介用户接口，包括 CLI，JDBC/ODBC，WebUI元数据存储，通常是存储在关系数据库如 mysql, derby 中解释器、编译器、优化器、执行器Hadoop：用 HDFS 进行存储，利用 MapReduce 进行计算 l 用户接口主要有三个：CLI，JDBC/ODBC和 WebUICLI，即Shell命令行JDBC/ODBC 是 Hive 的Java，与使用
复制链接

扫一扫