1,开始学习Hadoop的时候为了练习单词统计,排序,每次都得用java编写MapReduce程序,常常一个单词统计的代码都得写很久,所以我就提前练习了一下hive语法,做一个单词的统计。
2,首先本地构造数据,数据内容如下:
[hadoop@master ~]$ cat count.txt
hello,world,welcome
hello,welcome
world,hello,hi
[hadoop@master ~]$
3,然后在hive上创建一张表,再将本地数据导入:
hive> create table count( //创建一张名字为count的表
> sentence string);
OK
Time taken: 0.092 seconds
hive> load data local inpath '/home/hadoop/count.txt' into table count; //将本地数据导入到hive数据库中
Loading data to table default.count
Table default.count stats: [numFiles=1, totalSize=49]
OK
Time taken: 0.465 seconds
hive> select * from count; //查询表的内容
OK
hello,world,welcome
hello,welcome
world,hello,hi
Time taken: 0.09 seconds, Fetched: 3 row(s)
4,利用MapReduce思想,先把每行数据转化为一个数组的形式:
hive> select split(sentence,',') from count;
OK
["hello","world","welcome"]
["hello","welcome"]
["world","hello","hi"]
Time taken: 0.141 seconds, Fetched: 3 row(s)
5,然后将数组的每一个单词拆分出来每行就只有一个单词:
hive> select explode(split(sentence,',')) from count;
OK
hello
world
welcome
hello
welcome
world
hello
hi
Time taken: 0.081 seconds, Fetched: 8 row(s)
6,组合语句写一个单词统计,并排序:
hive> select word ,count(1) as c
> from (
> select explode(split(sentence,',')) as word from count
> ) t group by word
> order by c desc;
最后跑完MapReduce,结果如下:
hello 3
world 2
welcome 2
hi 1