MapReduce程序初步入门

最新推荐文章于 2017-12-14 20:15:10 发布

luohandsome

最新推荐文章于 2017-12-14 20:15:10 发布

阅读量1.5k

点赞数

分类专栏：技术文章标签： mapreduce

本文链接：https://blog.csdn.net/luohandsome/article/details/48263781

版权

技术专栏收录该内容

29 篇文章 0 订阅

订阅专栏

1.编写一个WordCount的类
实现两个内部类，一个继承Mapper模板类，一个继承Reducer模板类
Mapper和Reducer的四个模板参数如下：
KEYIN, VALUEIN, KEYOUT, VALUEOUT
分别表示输入的Key的类型，输入的value类似，输出的Key类型，输出的value类似
Object表示以输入文件每一行在文件中的偏移
Text表示文本
IntWritable表示int整数
2.如果输入每一行是一段文本，由多个单词组成，那么对于map的输入就是
key = Object
value = Text
如果输入文件内容是
hello world hello xiaomi
i love xiaomi i love world
那么map读到两行数据，分别得到
map key 0 value hello world hello xiaomi
map key 25 value i love xiaomi i love world
在map函数中，对value进行split分割
调用context.write分成多行输出
输出的格式是
key = Text
value = IntWritable

3.对于reduce来说，map的输出的key相当于reduce的输入的key，但是map的输出的多个value聚合起来对应reduce的一个value
所以reduce的输入格式是
key = Text
value = 多个IntWritable
对多个IntWritable进行累加，计算单词个数。

4.一开始会报文件格式不对，不是sequence file。我查了一下，一般hbase的数据会用sequence file
不过检查了一下，我的输入文件是普通文本，所以应该这样写：
job.setInputFormatClass(TextInputFormat.class);
或者不写，默认是文本。

之后会报这个错：
14/06/23 13:06:10 INFO mapred.JobClient: Task Id : attempt_201402111944_0012_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.IntWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1024)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:690)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.luoyan.mapred.WordCount$WordCountMapper.map(WordCount.java:24)
at com.luoyan.mapred.WordCount$WordCountMapper.map(WordCount.java:20)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
实际是上因为
job.setMapOutputValueClass()设置得不对
要保证和函数参数一致。
另一方面
o.a.h.mapred与o.a.h.mapreduce是两套不兼容的API，后者的API更新，建议用新的API。
5.所以，最终的代码如下：
WordCount.java
package com.luoyan.mapred;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;

public class WordCount {
public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
System.out.println(“map key ” + key.toString() + ” value ” + value.toString());
String[] tokens = value.toString().split(” “);
for (String token : tokens) {
context.write(new Text(token),
new IntWritable(1));
}
}
}
public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
String values_string = “”;
for (IntWritable value : values) {
sum += value.get();
values_string = values_string + ” ” + value.get();
}

System.out.println(“reduce key ” + key.toString() + ” values [" + values_string + "]“);
context.write(new Text(key),
new IntWritable(sum));
}
}
private static void usage() {
System.out.println(“usage : command fromFile toFile reduceNum”);
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = new Configuration();
GenericOptionsParser genericOptionsParser = new GenericOptionsParser(configuration, args);
String[] otherArgs = genericOptionsParser.getRemainingArgs();
configuration = genericOptionsParser.getConfiguration();
System.out.println(“args.length ” + args.length);
if (args.length < 3) {
usage();
return;
}
String fromFile = otherArgs[0];
String toFile = otherArgs[1];
int reduceNum = Integer.parseInt(otherArgs[2]);

Job job = Job.getInstance(configuration, “word_count”);
job.setNumReduceTasks(reduceNum);
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

//job.setInputFormatClass(SequenceFileInputFormat.class);
//HDFSHelper.addInputPath(job, fromFile);
job.setInputFormatClass(TextInputFormat.class);
//job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(fromFile));
Path outputPath = new Path(toFile);
FileSystem fs = FileSystem.get(configuration);
if (fs.exists(outputPath)) {
fs.delete(outputPath, true);
}
FileOutputFormat.setOutputPath(job, new Path(toFile));

job.waitForCompletion(true);
}
}

pom.xml如下
<project xmlns=”http://maven.apache.org/POM/4.0.0″ xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd”>
<modelVersion>4.0.0</modelVersion>
<groupId>com.luoyan.mapred</groupId>
<artifactId>mapred</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>simpleMapReduce</name>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>

<build>

<!– build所需要的资源 –>
<resources>
<resource>
<directory>${project.basedir}/src/main/resources</directory>
<filtering>true</filtering>
</resource>
<!–
<resource>
<directory>${basedir}/src/main/resources</directory>
<includes>
<include>hadoop-production-lg/core-site.xml</include>
<include>hadoop-production-lg/hdfs-site.xml</include>
<include>hadoop-production-lg/krb5.conf</include>
</includes>
</resource>
–>
</resources>
<!– build所需要的资源 –>
<plugins>
<!– Bind the maven-assembly-plugin to the package phase this will create
a jar file without the storm dependencies suitable for deployment to a cluster. –>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
<fork>true</fork>
<verbose>true</verbose>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.basedir}/target/lib</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<downloadJavadocs>true</downloadJavadocs>
</configuration>
</plugin>
</plugins>
</build>
</project>

hdfs上
/user/text/a.txt如下
hello world hello xiaomi
i love xiaomi i love world

命令行启动如下:
~/hadoop/bin/hadoop jar target/mapred-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.luoyan.mapred.WordCount /user/test/a.txt /user/test/b.txt 2
内容就会放在/user/b.txt/下面

luohandsome

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce程序初步入门

1.编写一个WordCount的类实现两个内部类，一个继承Mapper模板类，一个继承Reducer模板类Mapper和Reducer的四个模板参数如下：KEYIN, VALUEIN, KEYOUT, VALUEOUT分别表示输入的Key的类型，输入的value类似，输出的Key类型，输出的value类似Object表示以输入文件每一行在文件中的偏移Text表示文本In
复制链接

扫一扫