前言
其实,使用MapReduce计算最大值的问题,和Hadoop自带的WordCount的程序没什么区别,不过在Reducer中一个是求最大值,一个是做累加,本质一样,比较简单。下面我们结合一个例子来实现。
测试数据
我们通过自己的模拟程序,生成了一组简单的测试样本数据。输入数据的格式,截取一个片段,如下所示:
SG 253654006139495 253654006164392 619850464KG 253654006225166 253654006252433 743485698UZ 253654006248058 253654006271941 570409379TT 253654006282019 253654006286839 23236775BE 253654006276984 253654006301435 597874033BO 253654006293624 253654006315946 498265375SR 253654006308428 253654006330442 484613339SV 253654006320312 253654006345405 629640166LV 253654006330384 253654006359891 870680704FJ 253654006351709 253654006374468 517965666
上面文本数据一行一行存储,一行包含4部分,分别表示:
- 国家代码
- 起始时间
- 截止时间
- 随机成本/权重估值
各个字段之间以空格号分隔。我们要计算的结果是,求各个国家(以国家代码标识)的成本估值的最大值。
编程实现
因为比较简单,直接看实际的代码。代码分为三个部分,当然是Mapper、Reducer、Driver。Mapper实现类为GlobalCostMapper,实现代码如下所示:
package org.shirdrn.kodz.inaction.hadoop.extremum.max; import java.io.IOException; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper; public class GlobalCostMapper extends Mapper { private final static LongWritable costValue = new LongWritable(0); private Text code = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // a line, such as 'SG 253654006139495 253654006164392 619850464' String line = value.toString(); String[] array = line.split("s"); if (array.length == 4) { String countryCode = array[0]; String strCost = array[3]; long cost = 0L; try { cost = Long.parseLong(strCost); } catch (NumberFormatException e) { cost = 0L; } if (cost != 0) { code.set(countryCode); costValue.set(cost); context.write(code, costValue); } } }}
上面实现逻辑非常简单,就是根据空格分隔符,将各个字段的值分离出来,最后输出键值对。接着,Mapper输出了的键值对列表,在Reducer中就需要进行合并化简,Reducer的实现类为GlobalCostReducer,实现代码如下所示:
package org.shirdrn.kodz.inaction.hadoop.extremum.max; import java.io.IOException;import java.util.Iterator; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer; public class GlobalCostReducer extends Reducer { @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { long max = 0L; Iterator iter = values.iterator(); while (iter.hasNext()) { LongWritable current = iter.next(); if (current.get() > max) { max = current.get(); } } context.write(key, new LongWritable(max)); }}
上面计算一组键值对列表中代价估值的最大值,逻辑比较简单。为了优化,在Map输出以后,可以使用该Reducer进行合并操作,即作为Combiner,减少从Mapper到Reducer的数据传输量,在配置Job的时候可以指定。下面看,如何来配置和运行一个Job,实现类为GlobalMaxCostDriver,实现代码如下所示:
package org.shirdrn.kodz.inaction.hadoop.extremum.max; import java.io.IOException; import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser; public class GlobalMaxCostDriver { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: maxcost "); System.exit(2); } Job job = new Job(conf, "max cost"); job.setJarByClass(GlobalMaxCostDriver.class); job.setMapperClass(GlobalCostMapper.class); job.setCombinerClass(GlobalCostReducer.class); job.setReducerClass(GlobalCostReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); int exitFlag = job.waitForCompletion(true) ? 0 : 1; System.exit(exitFlag); }}
运行程序
首先,需要保证Hadoop集群正常运行,我这里NameNode是主机ubuntu3。下面看运行程序的过程:
- 编译代码(我直接使用Maven进行),打成jar文件
shirdrn@SYJ:~/programs/eclipse-jee-juno/workspace/kodz-all/kodz-hadoop/target/classes$ jar -cvf global-max-cost.jar -C ./ org
- 拷贝上面生成的jar文件,到NameNode环境中
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ scp shirdrn@172.0.8.212:~/programs/eclipse-jee-juno/workspace/kodz-all/kodz-hadoop/target/classes/global-max-cost.jar ./
global-max-cost.jar
- 上传待处理的数据文件
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -copyFromLocal /opt/stone/cloud/dataset/data_10m /user/xiaoxiang/datasets/cost/
- 运行我们编写MapReduce任务,计算最大值
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop jar global-max-cost.jar org.shirdrn.kodz.inaction.hadoop.extremum.max.GlobalMaxCostDriver /user/xiaoxiang/datasets/cost /user/xiaoxiang/output/cost
运行过程控制台输出内容,大概如下所示:
13/03/22 16:30:16 INFO input.FileInputFormat: Total input paths to process : 113/03/22 16:30:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/03/22 16:30:16 WARN snappy.LoadSnappy: Snappy native library not loaded13/03/22 16:30:16 INFO mapred.JobClient: Running job: job_201303111631_000413/03/22 16:30:17 INFO mapred.JobClient: map 0% reduce 0%13/03/22 16:30:33 INFO mapred.JobClient: map 22% reduce 0%13/03/22 16:30:36 INFO mapred.JobClient: map 28% reduce 0%13/03/22 16:30:45 INFO mapred.JobClient: map 52% reduce 9%13/03/22 16:30:48 INFO mapred.JobClient: map 57% reduce 9%13/03/22 16:30:57 INFO mapred.JobClient: map 80% reduce 9%13/03/22 16:31:00 INFO mapred.JobClient: map 85% reduce 19%13/03/22 16:31:10 INFO mapred.JobClient: map 100% reduce 28%13/03/22 16:31:19 INFO mapred.JobClient: map 100% reduce 100%13/03/22 16:31:24 INFO mapred.JobClient: Job complete: job_201303111631_000413/03/22 16:31:24 INFO mapred.JobClient: Counters: 2913/03/22 16:31:24 INFO mapred.JobClient: Job Counters13/03/22 16:31:24 INFO mapred.JobClient: Launched reduce tasks=113/03/22 16:31:24 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7677313/03/22 16:31:24 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=013/03/22 16:31:24 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=013/03/22 16:31:24 INFO mapred.JobClient: Launched map tasks=713/03/22 16:31:24 INFO mapred.JobClient: Data-local map tasks=713/03/22 16:31:24 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=4049713/03/22 16:31:24 INFO mapred.JobClient: File Output Format Counters13/03/22 16:31:24 INFO mapred.JobClient: Bytes Written=302913/03/22 16:31:24 INFO mapred.JobClient: FileSystemCounters13/03/22 16:31:24 INFO mapred.JobClient: FILE_BYTES_READ=14260913/03/22 16:31:24 INFO mapred.JobClient: HDFS_BYTES_READ=44891365313/03/22 16:31:24 INFO mapred.JobClient: FILE_BYTES_WRITTEN=33815113/03/22 16:31:24 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=302913/03/22 16:31:24 INFO mapred.JobClient: File Input Format Counters13/03/22 16:31:24 INFO mapred.JobClient: Bytes Read=44891279913/03/22 16:31:24 INFO mapred.JobClient: Map-Reduce Framework13/03/22 16:31:24 INFO mapred.JobClient: Map output materialized bytes=2124513/03/22 16:31:24 INFO mapred.JobClient: Map input records=1000000013/03/22 16:31:24 INFO mapred.JobClient: Reduce shuffle bytes=1821013/03/22 16:31:24 INFO mapred.JobClient: Spilled Records=1258213/03/22 16:31:24 INFO mapred.JobClient: Map output bytes=11000000013/03/22 16:31:24 INFO mapred.JobClient: CPU time spent (ms)=8032013/03/22 16:31:24 INFO mapred.JobClient: Total committed heap usage (bytes)=153563955213/03/22 16:31:24 INFO mapred.JobClient: Combine input records=1000932013/03/22 16:31:24 INFO mapred.JobClient: SPLIT_RAW_BYTES=85413/03/22 16:31:24 INFO mapred.JobClient: Reduce input records=163113/03/22 16:31:24 INFO mapred.JobClient: Reduce input groups=23313/03/22 16:31:24 INFO mapred.JobClient: Combine output records=1095113/03/22 16:31:24 INFO mapred.JobClient: Physical memory (bytes) snapshot=170670899213/03/22 16:31:24 INFO mapred.JobClient: Reduce output records=23313/03/22 16:31:24 INFO mapred.JobClient: Virtual memory (bytes) snapshot=431687270413/03/22 16:31:24 INFO mapred.JobClient: Map output records=10000000
- 验证Job结果输出
xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -cat /user/xiaoxiang/output/cost/part-r-00000AD 999974516AE 999938630AF 999996180AG 999991085AI 999989595AL 999998489AM 999976746AO 999989628AQ 999995031AR 999953989AS 999935982AT 999999909AU 999937089AW 999965784AZ 999996557BA 999949773BB 999987345BD 999992272BE 999925057BF 999999220BG 999971528BH 999994900BI 999978516BJ 999977886BM 999991925BN 999986630BO 999995482BR 999989947BS 999980931BT 999977488BW 999935985BY 999998496BZ 999975972CA 999978275CC 999968311CD 999978139CF 999995342CG 999788112CH 999997524CI 999998864CK 999968719CL 999967083CM 999998369CN 999975367CO 999999167CR 999971685CU 999976352CV 999990543CW 999987713CX 999987579CY 999982925CZ 999993908DE 999985416DJ 999997438DK 999963312DM 999941706DO 999945597DZ 999973610EC 999920447EE 999949534EG 999980522ER 999980425ES 999949155ET 999987033FI 999966243FJ 999990686FK 999966573FM 999972146FO 999988472FR 999988342GA 999982099GB 999970658GD 999996318GE 999991970GF 999982024GH 999941039GI 999995295GL 999948726GM 999967823GN 999951804GP 999904645GQ 999988635GR 999999672GT 999972984GU 999919056GW 999962551GY 999999881HK 999970084HN 999972628HR 999986688HT 999970913HU 999997568ID 999994762IE 999996686IL 999982184IM 999987831IN 999914991IO 999968575IQ 999990126IR 999986780IS 999973585IT 999997239JM 999982209JO 999977276JP 999983684KE 999996012KG 999991556KH 999975644KI 999994328KM 999989895KN 999991068KP 999967939KR 999992162KW 999924295KY 999977105KZ 999992835LA 999989151LB 999963014LC 999962233LI 999986863LK 999989876LR 999897202LS 999957706LT 999999688LU 999999823LV 999945411LY 999992365MA 999922726MC 999978886MD 999996042MG 999996602MH 999989668MK 999968900ML 999990079MM 999987977MN 999969051MO 999977975MP 999995234MQ 999913110MR 999982303MS 999974690MT 999982604MU 999988632MV 999961206MW 999991903MX 999978066MY 999995010MZ 999981189NA 999961177NC 999961053NE 999990091NF 999989399NG 999985037NI 999965733NL 999949789NO 999993122NP 999972410NR 999956464NU 999987046NZ 999998214OM 999967428PA 999924435PE 999981176PF 999959978PG 999987347PH 999981534PK 999954268PL 999996619PM 999998975PR 999906386PT 999993404PW 999991278PY 999985509QA 999995061RE 999952291RO 999994148RS 999999923RU 999894985RW 999980184SA 999973822SB 999972832SC 999973271SD 999963744SE 999972256SG 999977637SH 999983638SI 999980580SK 999998152SL 999999269SM 999941188SN 999990278SO 999973175SR 999975964ST 999980447SV 999999945SX 999903445SY 999988858SZ 999992537TC 999969540TD 999999303TG 999977640TH 999968746TJ 999983666TK 999971131TM 999958998TN 999963035TO 999947915TP 999986796TR 999995112TT 999984435TV 999971989TW 999975092TZ 999992734UA 999970993UG 999976267UM 999998377US 999912229UY 999989662UZ 999982762VA 999975548VC 999991495VE 999997971VG 999949690VI 999990063VN 999974393VU 999953162WF 999947666WS 999970242YE 999984650YT 999994707ZA 999998692ZM 999973392ZW 999928087
可见,结果是我们所期望的。(原创:时延军(包含链接:http://shiyanjun.cn))