（四）Hive中的几种表

最新推荐文章于 2022-07-18 21:10:24 发布

秦时盖聂

最新推荐文章于 2022-07-18 21:10:24 发布

阅读量552

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/qinshi965273101/article/details/84662524

版权

4 篇文章 0 订阅

订阅专栏

先有表，后有数据。先创建了表对应的文件夹，再把数据上传到文件夹下作为表数据。

create table people (col1 string, col2 string) row format delimited fields terminated by '\t';

先有数据，后有表。先在hdfs上有了数据文件，在创建表关联到数据，来管理数据。

create external table people (col1 string, col2 string) row format delimited fields terminated by '\t' location '/people';

目的是提升查询效率

此处分区字段有两个

create table phone (col1 string, col2 string) partitioned by (country string， size string) row format delimited fields terminated by '\t';

加载数据时指明分区的值，可以看到phone目录下会创建一个 country=china 的目录，count=china 目录下又会创建一个 size=large 的目录，更多分区字段时以此类推。

load data local inpath '/opt/test.txt' overwrite into table phone partition (country='china', size='large');

目的：实现数据抽样，即把大的数据分为多份小的数据，但每个小的数据也保留源数据的特性。

场景：使用庞大的数据做测试，耗费时间。所以需要从庞大的数据中抽样出少部分数据来做测试。

实现原理：利用hash算法的散列特性，对原数据中的某个字段，进行计算得到一个值，并对桶的个数取余。这样就能把原数据比较的均匀的散列到各个桶，且每个桶的数据都保持这原数据的特性，可以代替原数据做测试。

------------------------------------------------------------------------------------------------------------------------------------------------

hive默认关闭分桶功能，需要手动开启： set hive.enforce.bucketing = true;

若有一个原表 students，现在我们则新创建一个分桶表，指定分桶个数为2，并按照字段 id去做hash分桶。

create table students_temp (id int, name string) clustered by (id) into 2 buckets row format delimited fields terminated by '\t';

把原表数据插入分桶表，该命令会转为map和reduce任务，且有2个reduce。每个桶其实对应到文件夹中的一个文件。

insert into students_temp select * from students;

查询某个桶的数据，如下，把数据分成两份，取其中的第一份。

如果表没有分桶，也会查询出取样后的数据。只是会把数据加载到内存中计算。

如果份数和表指定的分桶个数不一致，则会把数据加载到内存中计算，得到取样结果。

如果份数和表指定的分桶个数一致，则直接找到分桶对应的文件即可，效率大大提高。

 select * from students_temp tablesample(bucket 1 out of 2 on id);

关注

专栏目录