Hive实现wordCount程序

最新推荐文章于 2024-05-07 07:00:00 发布

TURING.DT

最新推荐文章于 2024-05-07 07:00:00 发布

阅读量5.1k

点赞数 2

分类专栏： Hive

本文链接：https://blog.csdn.net/levy_cui/article/details/51142816

版权

Hive 专栏收录该内容

31 篇文章 0 订阅

订阅专栏

Hive实现wordCount程序

a. 创建一个数据库，如
create database word;

b. 建表
create external table word_data(line string) row format delimited fields terminated by '\n' stored as textfile location '/home/hadoop/worddata';

这里假设我们的数据存放在hadoop下，路径为：/home/hadoop/worddata，里面主要是一些单词文件，内容大概为：

hello man
what are you doing now
my running
hello
kevin
hi man

执行了上述hql就会创建一张表src_data，内容是这些文件的每行数据，每行数据存在字段line中，select * from word_data;就可以看到这些数据

c. 根据MapReduce的规则，我们需要进行拆分，把每行数据拆分成单词，这里需要用到一个hive的内置表生成函数（UDTF）：explode(array)，参数是array，其实就是行变多列：

create table words(word string);
insert into table words select explode(split(line, " ")) as word from word_data;

查看words表内容
OK
hello
man
what
are
you
doing
now
my
running
hello
kevin
hi
man

split是拆分函数，跟java的split功能一样，这里是按照空格拆分，所以执行完hql语句，words表里面就全部保存的单个单词

d. 这样基本实现了，因为hql可以group by，所以最后统计语句为：

select word, count(*) from word.words group by word;
注释：word.words 库名称.表名称，group by word这个word是create table words(word string) 命令创建的word string

结果：
are     1
doing   1
hello   2
hi      1
kevin   1
man     2
my      1
now     1
running 1
what    1
you     1

总结：对比写MR和使用hive，还是hive比较简便，对于比较复杂的统计操作可以建一些中间表，或者一些视图之类的。

TURING.DT

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Hive实现wordCount程序

Hive实现wordCount程序a. 创建一个数据库，如create database word;b. 建表create external table word_data(line string) row format delimited fields terminated by '\n' stored as textfile location '/home/hadoop
复制链接

扫一扫