Hadoop --Aggregate 包使用 Streaming

最新推荐文章于 2024-09-11 22:16:31 发布

太阳的味道

最新推荐文章于 2024-09-11 22:16:31 发布

阅读量1.1k

点赞数

分类专栏： mapreduce hadoop 文章标签： hadoop mapreduce streaming

hadoop 同时被 2 个专栏收录

27 篇文章 0 订阅

订阅专栏

mapreduce

5 篇文章 0 订阅

订阅专栏

Hadoop --Aggregate 包使用 Streaming

Hadoop 中有个称为 Aggregate 的包.它把一些常用的功能包括在里面了,比如求和,求平均数等.每个功能对应一个函数.只需要在使用时声明用哪个即可.

它的使用方式是在 reducer 脚本中直接写上函数的名称.

它具体的功能有:

DoubleValueSun 一个 double 值序列的和

LongValueMax 一个 long 序列的最大值

LongValueMin 一个 long 序列的最小值

LongValueSum 一个 long 序列的和

StringValueMax 一个字符串序列的字母排序的最大值

StringValueMin 一个字符串序列的字母排序的最小值

UniqValueCount 每个键的唯一值的个数

ValueHistogram 求每个值的:个数,最小值,中间值,平均值,最大值,标准方差

实例一:

-----------------------------------------------------

求每年的专利个数.

attributecount.php

Hadoop <wbr>--Aggregate <wbr>包使用 <wbr>Streaming <wbr>及 <wbr>Combiner

[root@localhost bin]# ./hadoop jar ../contrib/streaming/hadoop-streaming-1.2.0.jar -input ./apat63_99.txt -output ./test -mapper 'php attributecount.php 1' -reducer aggregate -file attributecount.php

结果:

[root@localhost bin]# ./hadoop fs -cat ./test/part-00000 | head -n 5

"GYEAR" 1

1963 45679

1964 47375

1965 62857

可以看出.reducer 里写上 aggregate 就相当于告诉 hadoop 去使用这个包.而 PHP 脚本中的 LongValueSum 表示使用该函数去处理数据.

实例二

-----------------------------------------------------

求每年授权专利的国家数.也就是每年授权这些专利所属国家的唯一个数

uniquecount.php

[root@localhost bin]# ./hadoop jar ../contrib/streaming/hadoop-streaming-1.2.0.jar -input ./apat63_99.txt -output ./testunique -mapper 'php uniquecount.php 1 4' -reducer aggregate -file uniquecount.php

看结果:

[root@localhost bin]# ./hadoop fs -cat ./testunique/part-00000 | head -n 5

"GYEAR" 1

1963 64

1964 58

1965 67

实例三

-----------------------------------------------------

使用 ValueHistogram 算出:唯一值个数,最小个数,中值个烽,最大个数,平均个数,标准方差

valuehistogram.php

[root@localhost bin]# ./hadoop jar ../contrib/streaming/hadoop-streaming-1.2.0.jar -input ./apat63_99.txt -output ./testhistogram -mapper 'php valuehistogram.php 1 4' -reducer aggregate -file valuehistogram.php

看结果:

[root@localhost bin]# ./hadoop fs -cat ./testhistogram/part-00000 | head -n 5

"GYEAR" 1 1 1 1 1.0 0.0

1963 64 1 5 37174 713.734375 4610.076525402627

1964 58 1 7 38410 816.8103448275862 4997.413601595352

1965 67 1 5 50331 938.1641791044776 6104.779230296307

可以看出,这里算出的唯一个数和前面的 UniqValueCount 算出的值一样.

解读上面的结果:

第一列为年份;第二列为取得专利的国家数;第三列为最小值;第四列为中值,第五列为最大值.

1963 年,接收专利最少的国家是接收了 1 个.接收专利最多的国家接收了 37174 个.按专利数排序,排在最中间的国家的专利数是 7.也就是说有一半的国家专利数低于 7 个.

一个国家接收专利的平均数在 64 年是 816.8 标准方差是 4997.4, 而中值: 7 和平均值 816.8 相差太多,说明分布是极度不均匀的.