hive练习（一）

最新推荐文章于 2023-11-03 00:54:51 发布

谁说大象不能跳舞

最新推荐文章于 2023-11-03 00:54:51 发布

阅读量349

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/jiahonhyu0609/article/details/88744725

版权

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1.创建article表（建内部表）

create table article(sentence string)
row format delimited fields terminated by '\n'

--从本地导入数据
load data local inpath '/home/wl/mapreduce_wordcount_python/The_Man_of_Property.txt' overwrite into table article;

2.在hive中wordcount

第一步 ：按照空格把每个单词分割
①.select explode(split(sentence, ' ')) as word from article limit 10;

第二步：对上一步的word做聚合
select word ,count(*)  from (select explode(split(sentence,' ' )) as word from article limlit 10) t group by word limit 10;

Hive架构流程图
在这里插入图片描述
3.hive表的本质就是hadoop的hdfs的文件
何时做分区表？
例如实时的数据：200亿条用户的行为数据，从10000亿中查询200亿，每天都生成数据
。每天的数据都会放到对应的文件夹里面：
dt=20180414
我们只需要到特定的文件夹中查找
select * from badou.news where dt=‘20180414’
where dt in(‘20180414’,‘20180413’)，如果这里面有20多天的数据，则用脚本写，然后把脚本copy到里面执行。
还有我们和别的表join时，他的表中dt如果不做分区表，会特别大的数据，很复杂执行效率低

4.set hive.cli.print.header=true;
设置这个我们在查表时就显示列名称了
select * from udata limit 10;
创建udata表

 CREATE TABLE udata (     
   user_id INT,  
   item_id INT,  
   rating INT,    
   timestamps STRING)  
   ROW FORMAT DELIMITED  
   FIELDS TERMINATED BY '\t'  
   STORED AS TEXTFILE；
	
加载数据：
LOAD DATA LOCAL INPATH '/home/wl/hive/ml-100k/u.data'  OVERWRITE INTO TABLE udata;

做分桶：
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
user_id 分3个桶：（取模，最后只能取0,1，2）
196%3=1
186%3 = 0
22%3 = 1
23%3=2

设置4个buckets

--相当于设定了4个reduce，就会有四个文件生成

首先设置这个
set hive.enforce.bucketing = true;
创建
create table bucket_user (id int)clustered by (id) into 4 buckets;
插入数据：     cast()是转换类型
insert overwrite table bucket_user select cast(userid as int) from badou.udata;
数据的抽样查询：1/16 * 4 = 1/4.    1/4的数据量
select * from badou.bucket_user tablesample(bucket 1 out of 16 on id) limit 10;