Hadoop子项目——hive

最新推荐文章于 2024-04-07 21:20:21 发布

坚持到底cw

最新推荐文章于 2024-04-07 21:20:21 发布

阅读量683

点赞数

分类专栏： hadoop学习整理文章标签： hadoop hive 数据库

本文链接：https://blog.csdn.net/chenwei825825/article/details/16985479

版权

hadoop学习整理专栏收录该内容

12 篇文章 0 订阅

订阅专栏

1.hive是一个分布式、按列存储的数据仓库，它管理HDFS中的数据。
2.传统数据库是写时模式，hive是读时模式。更新、事物、索引传统数据库有，但是hive暂时还不支持。
3.类型转换:任何整数类型都可以隐式地转换为一个范围更大的类型。
hive数据类型:tinyint smallint int bigint float
doubleboolean string array map struct
4.hive把数据组织成表，通过这种方式为存储在HDFS的数据赋予结构，元数据（如表模式）存储在名为metastore的数据库中。metastore是hive元数据的集中存放地。
5.metastore包含一个内嵌的以本地磁盘作为存储的Derby数据库实例，只使用一个内嵌的数据库每次只能访问一个磁盘上的数据库文件，一次只能为每个metastore打开一个hive会话，如果要支持多会话，则需要使用一个独立的数据库，一般为MySQL.
6.建表：
>create table records (year STRING,temp INT,quality INT) rowformat delimited fields terminated by '/t';
(指定了特定的分隔符)
>create table ... AS select...from...(新表定义的列从select列中导出)
>create table .. .like...
7.导入数据;
>load data local input 'input/sam.txt' overwrite into tablerecords;(overwrite表示删除目录中已有的所有文件，省去，则把新文件加入到目录)
>insert overwrite table target [partitioned (dt='2010-01-01')]select col1,col2 from source;(此处overwrite不能省去，目标表会被select结构替换)
>insert overwrite table a1 select...
insertoverwrite table a2 select...
...
fromtable_user;(多表插入)
8.托管表:建表时不使用external，加载数据时，把数据会移到仓库目录，DROP时把元数据和数据一起删除。
外部表：建表时使用external，不把数据移到仓库目录，DROP时不碰数据，只删除元数据。外部表可以用于从hive导出数据供其他程序使用。
9.分区：
建表时用Partitioned by ，其定义的列是表中正式的列，为“分区列”，但是数据文件并不包含这些列的值。
>create (external) table logs(ts bigint,line string) partitionedby (dt string ,country string);
>load data local input 'input/sam.txt' into tablelogs partition (dt='2010-01-01',country='GB');
加载数据到分区表时，要显示指定分区值
>show partitions logs;
10.桶:clustered by (id) into 5buckets
把表或者分区组织成桶的目的是为了获得更高的查询处理效率、使取样更高效。
>create table user(id int,name string) clustered by (id) sortedby (id ASC ) into 4 buckets; hive使用对值进行哈希并将结果除以桶的个数取余数。
>select * from user tablesample(bucket 1 out of 4 on id);对表进行取样
11.表的修改：
>alter table source rename to target;
>alter table source add columus(col3 string);
12.排序：
order by能产生完全排序结果，但是它是通过只使用一个reduce完成，这对于大规模数据效率低。
sort by为每个reducer产生一个排序文件。
>select year,temp distribute by year sort byyear ASC,temp DESC from source; (distribute by year使得相同的年份的行最终都在同一个reducer分区中 )
13.内连接：
>select sales.*,things.*from sales join things on(sales.id=things.id);
hive只支持等值连接，hive只允许在from子句中出现一个表。
14.外连接:可以找到连接表中不能匹配的行
左/右外连接：left/rightouter jion...on...
全外连接:full outer join...on...
15.半连接：left semi join
>select * from things left semi join sales on(sales.id=things.id);
等价于下面的语句
>select * from things where things.id in (select id fromsales);
left semi join查询时右表只能在on子句中出现。

坚持到底cw

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop子项目——hive

1.hive是一个分布式、按列存储的数据仓库，它管理HDFS中的数据。2.传统数据库是写时模式，hive是读时模式。更新、事物、索引传统数据库有，但是hive暂时还不支持。3.类型转换:任何整数类型都可以隐式地转换为一个范围更大的类型。 hive数据类型:tinyint smallint int bigint float doubleboolean string array ma
复制链接

扫一扫

专栏目录