MapReduce--WC案例以及初识源代码
1 MapReduce 介绍
- MapReduce 是一个基于Google的同名论文开发出来的。
- MapReduce 是一个分布式计算框架,每个组件之间分而治之。
- MapReduce 是一个高性能的分布式计算框架,用于对海量数据进行并行分析和处理
1.1 MapReduce 优点
- MapReduce适合于离线计算/批计算
- MapReduce编程对于自己实现分布式而言是要简单的
- 扩展性强,可以横向添加机器,来增加计算能力
- 处理的数据量可以很大
- 容错性能比较高
1.2 MapReduce缺点
- Task都是以进程级别运行,进程开启关闭需要时间,MapReduce总体运行速度会比较慢
- MapReduce不适合实时处理
2 MapReduce wordcount 案例
2.1 数据
hadoop,spark,flink
hbase,hadoop,spark,flink
spark
2.2 Code
2.2.1 Maven
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0-cdh5.16.2</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
2.2.2 WcDriver
package com.xk.bigdata.hadoop.mapreduce.wc;
import com.xk.bigdata.hadoop.utils.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WcDriver {
public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
IntWritable ONE = new IntWritable(1);
/**
* @param key : 文件的Offset
* @param value : 当前行的数据
* @param context : MapReduce 上下文,可以理解成缓存数据,把Map处理好的数据放进去
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] spilts = value.toString().split(",");
for (String word : spilts) {
context.write(new Text(word), ONE);
}
}
}
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
/**
* shuffle 之后把key相同的数据全部都放到一起 eg : <hadoop,<1,1>>
* @param key :eg: hadoop
* @param values :eg : <1,1>
* @param context : MapReduce 上下文,把Reducec处理好的数据放进去
*/
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable value : values) {
count += value.get();
}
context.write(key, new IntWritable(count));
}
}
public static void main(String[] args) throws Exception {
String input = "hdfs-basic/data/demo.txt";
String output = "hdfs-basic/out";
// 1 创建 MapReduce job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 删除输出路径
FileUtils.deleteFile(job.getConfiguration(),output);
// 2 设置运行主类
job.setJarByClass(WcDriver.class);
// 3 设置Map和Reduce运行的类
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// 4 设置Map 输出的 KEY 和 VALUE 数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5 设置Reduce 输出 KEY 和 VALUE 数据类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
// 7 提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
2.2.3 FileUtils
package com.xk.bigdata.hadoop.utils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class FileUtils {
public static void deleteFile(Configuration conf, String output) throws Exception {
FileSystem fileSystem = FileSystem.get(conf);
Path outputPath = new Path(output);
if (fileSystem.exists(outputPath)) {
fileSystem.delete(outputPath, true);
}
}
}
2.3 报错
2.3.1 找不到自定义Mapper
java.lang.NoSuchMethodException: com.xk.bigdata.hadoop.mapreduce.wc.WcDriver$MyMapper.<init>()
2.3.2 解决方案
- 把自定义的Mapper和Reducer类前面加上 statis
2.4 结果
flink 2
hadoop 2
hbase 1
spark 3
2.5 解析WC过程
3 提交到Yarn上面运行
3.1 修改代码:把输入数据路径和输出路径改成通过外部传参
String input = "";
String output = "";
if (args.length == 2) {
input = args[0];
output = args[1];
} else {
throw new Exception("parameter error");
}
3.2 package
3.3 Linux运行命令
hadoop jar /home/work/lib/hdfs-basic-1.0.jar com.xk.bigdata.hadoop.mapreduce.wc.WcDriver /data/wc/demo.txt /data/wc/out
3.4 查看输出结果
[work@bigdatatest02 lib]$ hadoop fs -text /data/wc/out/part*
flink 2
hbase 1
spark 3
hadoop 2
4 MapReduce 初识源代码
4.1 Mapper
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.task.MapContextImpl;
/**
* Maps input key/value pairs to a set of intermediate key/value pairs.
*
* <p>Maps are the individual tasks which transform input records into a
* intermediate records. The transformed intermediate records need not be of
* the same type as the input records. A given input pair may map to zero or
* many output pairs.</p>
*
* <p>The Hadoop Map-Reduce framework spawns one map task for each
* {@link InputSplit} generated by the {@link InputFormat} for the job.
* <code>Mapper</code> implementations can access the {@link Configuration} for
* the job via the {@link JobContext#getConfiguration()}.
*
* <p>The framework first calls
* {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
* {@link #map(Object, Object, Context)}
* for each key/value pair in the <code>InputSplit</code>. Finally
* {@link #cleanup(Context)} is called.</p>
*
* <p>All intermediate values associated with a given output key are
* subsequently grouped by the framework, and passed to a {@link Reducer} to
* determine the final output. Users can control the sorting and grouping by
* specifying two key {@link RawComparator} classes.</p>
*
* <p>The <code>Mapper</code> outputs are partitioned per
* <code>Reducer</code>. Users can control which keys (and hence records) go to
* which <code>Reducer</code> by implementing a custom {@link Partitioner}.
*
* <p>Users can optionally specify a <code>combiner</code>, via
* {@link Job#setCombinerClass(Class)}, to perform local aggregation of the
* intermediate outputs, which helps to cut down the amount of data transferred
* from the <code>Mapper</code> to the <code>Reducer</code>.
*
* <p>Applications can specify if and how the intermediate
* outputs are to be compressed and which {@link CompressionCodec}s are to be
* used via the <code>Configuration</code>.</p>
*
* <p>If the job has zero
* reduces then the output of the <code>Mapper</code> is directly written
* to the {@link OutputFormat} without sorting by keys.</p>
*
* <p>Example:</p>
* <p><blockquote><pre>
* public class TokenCounterMapper
* extends Mapper<Object, Text, Text, IntWritable>{
*
* private final static IntWritable one = new IntWritable(1);
* private Text word = new Text();
*
* public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
* StringTokenizer itr = new StringTokenizer(value.toString());
* while (itr.hasMoreTokens()) {
* word.set(itr.nextToken());
* context.write(word, one);
* }
* }
* }
* </pre></blockquote></p>
*
* <p>Applications may override the {@link #run(Context)} method to exert
* greater control on map processing e.g. multi-threaded <code>Mapper</code>s
* etc.</p>
*
* @see InputFormat
* @see JobContext
* @see Partitioner
* @see Reducer
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* The <code>Context</code> passed on to the {@link Mapper} implementations.
*/
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
/**
* Called once at the beginning of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}
/**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
}
- 在调用Mapper的时候需要传四个参数:Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
- KEYIN:Mapper数据进来的KEY的数据类型,如果是文件,一般都是offset是Mapper的KEY,一般使用LongWritable
- VALUEIN:Mapper数据进来的VALUE的数据类型,如果是文件的话,进来的是这一行数据,一般使用Text
- KEYOUT:Mapper数据输出的KEY的数据类型,本文我们使用的是WC作为讲解,输出的KEY是单词,一般使用Text
- VALUEOUT:Mapper数据输出的VALUE的数据类型,本文我们使用的是WC作为讲解,输出的KEY是单词的次数,可以使用IntWritable或者LongWritable
- run 方法
- 使用的模板模式,首先会先执行setup 然后执行map最后cleanup
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
4.2 Reducer
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.task.annotation.Checkpointable;
import java.util.Iterator;
/**
* Reduces a set of intermediate values which share a key to a smaller set of
* values.
*
* <p><code>Reducer</code> implementations
* can access the {@link Configuration} for the job via the
* {@link JobContext#getConfiguration()} method.</p>
* <p><code>Reducer</code> has 3 primary phases:</p>
* <ol>
* <li>
*
* <h4 id="Shuffle">Shuffle</h4>
*
* <p>The <code>Reducer</code> copies the sorted output from each
* {@link Mapper} using HTTP across the network.</p>
* </li>
*
* <li>
* <h4 id="Sort">Sort</h4>
*
* <p>The framework merge sorts <code>Reducer</code> inputs by
* <code>key</code>s
* (since different <code>Mapper</code>s may have output the same key).</p>
*
* <p>The shuffle and sort phases occur simultaneously i.e. while outputs are
* being fetched they are merged.</p>
*
* <h5 id="SecondarySort">SecondarySort</h5>
*
* <p>To achieve a secondary sort on the values returned by the value
* iterator, the application should extend the key with the secondary
* key and define a grouping comparator. The keys will be sorted using the
* entire key, but will be grouped using the grouping comparator to decide
* which keys and values are sent in the same call to reduce.The grouping
* comparator is specified via
* {@link Job#setGroupingComparatorClass(Class)}. The sort order is
* controlled by
* {@link Job#setSortComparatorClass(Class)}.</p>
*
*
* For example, say that you want to find duplicate web pages and tag them
* all with the url of the "best" known example. You would set up the job
* like:
* <ul>
* <li>Map Input Key: url</li>
* <li>Map Input Value: document</li>
* <li>Map Output Key: document checksum, url pagerank</li>
* <li>Map Output Value: url</li>
* <li>Partitioner: by checksum</li>
* <li>OutputKeyComparator: by checksum and then decreasing pagerank</li>
* <li>OutputValueGroupingComparator: by checksum</li>
* </ul>
* </li>
*
* <li>
* <h4 id="Reduce">Reduce</h4>
*
* <p>In this phase the
* {@link #reduce(Object, Iterable, Context)}
* method is called for each <code><key, (collection of values)></code> in
* the sorted inputs.</p>
* <p>The output of the reduce task is typically written to a
* {@link RecordWriter} via
* {@link Context#write(Object, Object)}.</p>
* </li>
* </ol>
*
* <p>The output of the <code>Reducer</code> is <b>not re-sorted</b>.</p>
*
* <p>Example:</p>
* <p><blockquote><pre>
* public class IntSumReducer<Key> extends Reducer<Key,IntWritable,
* Key,IntWritable> {
* private IntWritable result = new IntWritable();
*
* public void reduce(Key key, Iterable<IntWritable> values,
* Context context) throws IOException, InterruptedException {
* int sum = 0;
* for (IntWritable val : values) {
* sum += val.get();
* }
* result.set(sum);
* context.write(key, result);
* }
* }
* </pre></blockquote></p>
*
* @see Mapper
* @see Partitioner
*/
@Checkpointable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
/**
* The <code>Context</code> passed on to the {@link Reducer} implementations.
*/
public abstract class Context
implements ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
/**
* Called once at the start of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* This method is called once for each key. Most applications will define
* their reduce class by overriding this method. The default implementation
* is an identity function.
*/
@SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
/**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Advanced application writers can use the
* {@link #run(org.apache.hadoop.mapreduce.Reducer.Context)} method to
* control how the reduce task works.
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
// If a back up store is used, reset it
Iterator<VALUEIN> iter = context.getValues().iterator();
if(iter instanceof ReduceContext.ValueIterator) {
((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();
}
}
} finally {
cleanup(context);
}
}
}
- 在调用Reducer的时候需要传四个参数:Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
- KEYIN:Reducer数据进来的KEY的数据类型,即Mapper的输出数据KEY的数据类型
- VALUEIN:Reducer数据进来的VALUE的数据类型,即Mapper的输出数据VALUE的数据类型
- KEYOUT:Reducer数据输出的KEY的数据类型,本文我们使用的是WC作为讲解,输出的KEY是单词,一般使用Text
- VALUEOUT:Reducer数据输出的VALUE的数据类型,本文我们使用的是WC作为讲解,输出的KEY是单词的次数,可以使用IntWritable或者LongWritable
- run 方法
- 使用的模板模式,首先会先执行setup 然后执行map最后cleanup
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
// If a back up store is used, reset it
Iterator<VALUEIN> iter = context.getValues().iterator();
if(iter instanceof ReduceContext.ValueIterator) {
((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();
}
}
} finally {
cleanup(context);
}
}
4.3 初识MapReduce运行流程
waitForCompletion{
submit{
connect(){
Cluster.initialize(){
// 确定运行环境:LocalJobRunner、YARNRunner
}
}
submitter.submitJobInternal(Job.this, cluster){
//validate the jobs output specs
checkSpecs(job){
OutputFormat.checkOutputSpecs{
// 确定数据输出目录是否存在或者是否设定
}
}
int maps = writeSplits(job, submitJobDir){
// 确定读取数据的map数量
}
}
}
}