求列表最大值下标_大数据架构师,用HadoopMapReduce编程:计算最大值,你能学会吗...

前言

其实,使用MapReduce计算最大值的问题,和Hadoop自带的WordCount的程序没什么区别,不过在Reducer中一个是求最大值,一个是做累加,本质一样,比较简单。下面我们结合一个例子来实现。

ebba582ce2af2e3cb093920a723dcfd0.png

测试数据

我们通过自己的模拟程序,生成了一组简单的测试样本数据。输入数据的格式,截取一个片段,如下所示:

SG 253654006139495 253654006164392 619850464KG 253654006225166 253654006252433 743485698UZ 253654006248058 253654006271941 570409379TT 253654006282019 253654006286839 23236775BE 253654006276984 253654006301435 597874033BO 253654006293624 253654006315946 498265375SR 253654006308428 253654006330442 484613339SV 253654006320312 253654006345405 629640166LV 253654006330384 253654006359891 870680704FJ 253654006351709 253654006374468 517965666

上面文本数据一行一行存储,一行包含4部分,分别表示:

  1. 国家代码
  2. 起始时间
  3. 截止时间
  4. 随机成本/权重估值

各个字段之间以空格号分隔。我们要计算的结果是,求各个国家(以国家代码标识)的成本估值的最大值。

编程实现

因为比较简单,直接看实际的代码。代码分为三个部分,当然是Mapper、Reducer、Driver。Mapper实现类为GlobalCostMapper,实现代码如下所示:

package org.shirdrn.kodz.inaction.hadoop.extremum.max; import java.io.IOException; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper; public class GlobalCostMapper extends        Mapper {     private final static LongWritable costValue = new LongWritable(0);    private Text code = new Text();     @Override    protected void map(LongWritable key, Text value, Context context)            throws IOException, InterruptedException {        // a line, such as 'SG 253654006139495 253654006164392 619850464'        String line = value.toString();        String[] array = line.split("s");        if (array.length == 4) {            String countryCode = array[0];            String strCost = array[3];            long cost = 0L;            try {                cost = Long.parseLong(strCost);            } catch (NumberFormatException e) {                cost = 0L;            }            if (cost != 0) {                code.set(countryCode);                costValue.set(cost);                context.write(code, costValue);            }        }    }}

上面实现逻辑非常简单,就是根据空格分隔符,将各个字段的值分离出来,最后输出键值对。接着,Mapper输出了的键值对列表,在Reducer中就需要进行合并化简,Reducer的实现类为GlobalCostReducer,实现代码如下所示:

package org.shirdrn.kodz.inaction.hadoop.extremum.max; import java.io.IOException;import java.util.Iterator; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer; public class GlobalCostReducer extends        Reducer {     @Override    protected void reduce(Text key, Iterable values,            Context context) throws IOException, InterruptedException {        long max = 0L;        Iterator iter = values.iterator();        while (iter.hasNext()) {            LongWritable current = iter.next();            if (current.get() > max) {                max = current.get();            }        }        context.write(key, new LongWritable(max));    }}

上面计算一组键值对列表中代价估值的最大值,逻辑比较简单。为了优化,在Map输出以后,可以使用该Reducer进行合并操作,即作为Combiner,减少从Mapper到Reducer的数据传输量,在配置Job的时候可以指定。下面看,如何来配置和运行一个Job,实现类为GlobalMaxCostDriver,实现代码如下所示:

package org.shirdrn.kodz.inaction.hadoop.extremum.max; import java.io.IOException; import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser; public class GlobalMaxCostDriver {     public static void main(String[] args) throws IOException,            InterruptedException, ClassNotFoundException {         Configuration conf = new Configuration();        String[] otherArgs = new GenericOptionsParser(conf, args)                .getRemainingArgs();        if (otherArgs.length != 2) {            System.err.println("Usage: maxcost ");            System.exit(2);        }         Job job = new Job(conf, "max cost");         job.setJarByClass(GlobalMaxCostDriver.class);        job.setMapperClass(GlobalCostMapper.class);        job.setCombinerClass(GlobalCostReducer.class);        job.setReducerClass(GlobalCostReducer.class);         job.setOutputKeyClass(Text.class);        job.setOutputValueClass(LongWritable.class);         FileInputFormat.addInputPath(job, new Path(otherArgs[0]));        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));         int exitFlag = job.waitForCompletion(true) ? 0 : 1;        System.exit(exitFlag);    }}
b005f64592e0202afee3cfabaadafc60.png

运行程序

首先,需要保证Hadoop集群正常运行,我这里NameNode是主机ubuntu3。下面看运行程序的过程:

  • 编译代码(我直接使用Maven进行),打成jar文件

shirdrn@SYJ:~/programs/eclipse-jee-juno/workspace/kodz-all/kodz-hadoop/target/classes$ jar -cvf global-max-cost.jar -C ./ org

  • 拷贝上面生成的jar文件,到NameNode环境中

xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ scp shirdrn@172.0.8.212:~/programs/eclipse-jee-juno/workspace/kodz-all/kodz-hadoop/target/classes/global-max-cost.jar ./

global-max-cost.jar

  • 上传待处理的数据文件

xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -copyFromLocal /opt/stone/cloud/dataset/data_10m /user/xiaoxiang/datasets/cost/

  • 运行我们编写MapReduce任务,计算最大值

xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop jar global-max-cost.jar org.shirdrn.kodz.inaction.hadoop.extremum.max.GlobalMaxCostDriver /user/xiaoxiang/datasets/cost /user/xiaoxiang/output/cost

运行过程控制台输出内容,大概如下所示:

13/03/22 16:30:16 INFO input.FileInputFormat: Total input paths to process : 113/03/22 16:30:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/03/22 16:30:16 WARN snappy.LoadSnappy: Snappy native library not loaded13/03/22 16:30:16 INFO mapred.JobClient: Running job: job_201303111631_000413/03/22 16:30:17 INFO mapred.JobClient:  map 0% reduce 0%13/03/22 16:30:33 INFO mapred.JobClient:  map 22% reduce 0%13/03/22 16:30:36 INFO mapred.JobClient:  map 28% reduce 0%13/03/22 16:30:45 INFO mapred.JobClient:  map 52% reduce 9%13/03/22 16:30:48 INFO mapred.JobClient:  map 57% reduce 9%13/03/22 16:30:57 INFO mapred.JobClient:  map 80% reduce 9%13/03/22 16:31:00 INFO mapred.JobClient:  map 85% reduce 19%13/03/22 16:31:10 INFO mapred.JobClient:  map 100% reduce 28%13/03/22 16:31:19 INFO mapred.JobClient:  map 100% reduce 100%13/03/22 16:31:24 INFO mapred.JobClient: Job complete: job_201303111631_000413/03/22 16:31:24 INFO mapred.JobClient: Counters: 2913/03/22 16:31:24 INFO mapred.JobClient:   Job Counters13/03/22 16:31:24 INFO mapred.JobClient:     Launched reduce tasks=113/03/22 16:31:24 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=7677313/03/22 16:31:24 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=013/03/22 16:31:24 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=013/03/22 16:31:24 INFO mapred.JobClient:     Launched map tasks=713/03/22 16:31:24 INFO mapred.JobClient:     Data-local map tasks=713/03/22 16:31:24 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=4049713/03/22 16:31:24 INFO mapred.JobClient:   File Output Format Counters13/03/22 16:31:24 INFO mapred.JobClient:     Bytes Written=302913/03/22 16:31:24 INFO mapred.JobClient:   FileSystemCounters13/03/22 16:31:24 INFO mapred.JobClient:     FILE_BYTES_READ=14260913/03/22 16:31:24 INFO mapred.JobClient:     HDFS_BYTES_READ=44891365313/03/22 16:31:24 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=33815113/03/22 16:31:24 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=302913/03/22 16:31:24 INFO mapred.JobClient:   File Input Format Counters13/03/22 16:31:24 INFO mapred.JobClient:     Bytes Read=44891279913/03/22 16:31:24 INFO mapred.JobClient:   Map-Reduce Framework13/03/22 16:31:24 INFO mapred.JobClient:     Map output materialized bytes=2124513/03/22 16:31:24 INFO mapred.JobClient:     Map input records=1000000013/03/22 16:31:24 INFO mapred.JobClient:     Reduce shuffle bytes=1821013/03/22 16:31:24 INFO mapred.JobClient:     Spilled Records=1258213/03/22 16:31:24 INFO mapred.JobClient:     Map output bytes=11000000013/03/22 16:31:24 INFO mapred.JobClient:     CPU time spent (ms)=8032013/03/22 16:31:24 INFO mapred.JobClient:     Total committed heap usage (bytes)=153563955213/03/22 16:31:24 INFO mapred.JobClient:     Combine input records=1000932013/03/22 16:31:24 INFO mapred.JobClient:     SPLIT_RAW_BYTES=85413/03/22 16:31:24 INFO mapred.JobClient:     Reduce input records=163113/03/22 16:31:24 INFO mapred.JobClient:     Reduce input groups=23313/03/22 16:31:24 INFO mapred.JobClient:     Combine output records=1095113/03/22 16:31:24 INFO mapred.JobClient:     Physical memory (bytes) snapshot=170670899213/03/22 16:31:24 INFO mapred.JobClient:     Reduce output records=23313/03/22 16:31:24 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=431687270413/03/22 16:31:24 INFO mapred.JobClient:     Map output records=10000000
  • 验证Job结果输出
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -cat /user/xiaoxiang/output/cost/part-r-00000AD     999974516AE     999938630AF     999996180AG     999991085AI     999989595AL     999998489AM     999976746AO     999989628AQ     999995031AR     999953989AS     999935982AT     999999909AU     999937089AW     999965784AZ     999996557BA     999949773BB     999987345BD     999992272BE     999925057BF     999999220BG     999971528BH     999994900BI     999978516BJ     999977886BM     999991925BN     999986630BO     999995482BR     999989947BS     999980931BT     999977488BW     999935985BY     999998496BZ     999975972CA     999978275CC     999968311CD     999978139CF     999995342CG     999788112CH     999997524CI     999998864CK     999968719CL     999967083CM     999998369CN     999975367CO     999999167CR     999971685CU     999976352CV     999990543CW     999987713CX     999987579CY     999982925CZ     999993908DE     999985416DJ     999997438DK     999963312DM     999941706DO     999945597DZ     999973610EC     999920447EE     999949534EG     999980522ER     999980425ES     999949155ET     999987033FI     999966243FJ     999990686FK     999966573FM     999972146FO     999988472FR     999988342GA     999982099GB     999970658GD     999996318GE     999991970GF     999982024GH     999941039GI     999995295GL     999948726GM     999967823GN     999951804GP     999904645GQ     999988635GR     999999672GT     999972984GU     999919056GW     999962551GY     999999881HK     999970084HN     999972628HR     999986688HT     999970913HU     999997568ID     999994762IE     999996686IL     999982184IM     999987831IN     999914991IO     999968575IQ     999990126IR     999986780IS     999973585IT     999997239JM     999982209JO     999977276JP     999983684KE     999996012KG     999991556KH     999975644KI     999994328KM     999989895KN     999991068KP     999967939KR     999992162KW     999924295KY     999977105KZ     999992835LA     999989151LB     999963014LC     999962233LI     999986863LK     999989876LR     999897202LS     999957706LT     999999688LU     999999823LV     999945411LY     999992365MA     999922726MC     999978886MD     999996042MG     999996602MH     999989668MK     999968900ML     999990079MM     999987977MN     999969051MO     999977975MP     999995234MQ     999913110MR     999982303MS     999974690MT     999982604MU     999988632MV     999961206MW     999991903MX     999978066MY     999995010MZ     999981189NA     999961177NC     999961053NE     999990091NF     999989399NG     999985037NI     999965733NL     999949789NO     999993122NP     999972410NR     999956464NU     999987046NZ     999998214OM     999967428PA     999924435PE     999981176PF     999959978PG     999987347PH     999981534PK     999954268PL     999996619PM     999998975PR     999906386PT     999993404PW     999991278PY     999985509QA     999995061RE     999952291RO     999994148RS     999999923RU     999894985RW     999980184SA     999973822SB     999972832SC     999973271SD     999963744SE     999972256SG     999977637SH     999983638SI     999980580SK     999998152SL     999999269SM     999941188SN     999990278SO     999973175SR     999975964ST     999980447SV     999999945SX     999903445SY     999988858SZ     999992537TC     999969540TD     999999303TG     999977640TH     999968746TJ     999983666TK     999971131TM     999958998TN     999963035TO     999947915TP     999986796TR     999995112TT     999984435TV     999971989TW     999975092TZ     999992734UA     999970993UG     999976267UM     999998377US     999912229UY     999989662UZ     999982762VA     999975548VC     999991495VE     999997971VG     999949690VI     999990063VN     999974393VU     999953162WF     999947666WS     999970242YE     999984650YT     999994707ZA     999998692ZM     999973392ZW     999928087

可见,结果是我们所期望的。(原创:时延军(包含链接:http://shiyanjun.cn))

觉得文章不错的话,可以转发关注小编,每天更新干货文章。

cc7927eda7cb104095926699ef69156c.png
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是一个使用Hadoop MapReduce框架的实现: 首先,定义Mapper类和Reducer类。Mapper类的作用是将每个数字字符串转换为一个数字对,并将其发送到Reducer类。Reducer类的作用是计算输入数字的和、最大值、最小值和计数,并在最后输出结果。 ```java import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; public class SimpleStatistics { public static class TokenizerMapper extends Mapper<Object, Text, NullWritable, IntWritable>{ private final static IntWritable one = new IntWritable(1); private IntWritable number = new IntWritable(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] fields = line.split(","); for (String field : fields) { try { int num = Integer.parseInt(field.trim()); number.set(num); context.write(NullWritable.get(), number); } catch (NumberFormatException e) { // ignore invalid numbers } } } } public static class IntSumReducer extends Reducer<NullWritable,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); private Text type = new Text(); public void reduce(NullWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; int count = 0; int max = Integer.MIN_VALUE; int min = Integer.MAX_VALUE; for (IntWritable val : values) { int num = val.get(); sum += num; count++; if (num > max) { max = num; } if (num < min) { min = num; } } if (count > 0) { result.set(sum); type.set("sum"); context.write(type, result); result.set(max); type.set("max"); context.write(type, result); result.set(min); type.set("min"); context.write(type, result); result.set(sum / count); type.set("avg"); context.write(type, result); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "simple statistics"); job.setJarByClass(SimpleStatistics.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for (int i = 0; i < args.length - 1; i++) { Path inputPath = new Path(args[i]); job.addInputPath(inputPath); } Path outputPath = new Path(args[args.length - 1]); job.setOutputPath(outputPath); System.exit(job.waitForCompletion(true) ? 0 : 1); } } ``` 在上述代码中,Mapper类的map()函数将输入的数字字符串转换为IntWritable类型,并将其写入上下文。Reducer类的reduce()函数计算输入数字的和、最大值、最小值和计数,并将结果输出到上下文。 在main()函数中,我们设置了Mapper类和Reducer类,指定输入和输出路径,并运行Job。运行程序时,使用以下命令: ```shell hadoop jar SimpleStatistics.jar input1,input2,..,inputN output ``` 其中,input1,input2,..,inputN是输入文件的路径,用逗号分隔。output是输出文件的路径。 执行后,输出文件将包含以下统计数据: - sum:所有数字的总和 - max:最大值 - min:最小值 - avg:平均值 希望这能帮到你!

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值