MapReduce程序初步入门

1.编写一个WordCount的类
实现两个内部类,一个继承Mapper模板类,一个继承Reducer模板类
Mapper和Reducer的四个模板参数如下:
KEYIN, VALUEIN, KEYOUT, VALUEOUT
分别表示输入的Key的类型,输入的value类似,输出的Key类型,输出的value类似
Object表示以输入文件每一行在文件中的偏移
Text表示文本
IntWritable表示int整数
2.如果输入每一行是一段文本,由多个单词组成,那么对于map的输入就是
key = Object
value = Text
如果输入文件内容是
hello world hello xiaomi
i love xiaomi i love world
那么map读到两行数据,分别得到
map key 0 value hello world hello xiaomi
map key 25 value i love xiaomi i love world
在map函数中,对value进行split分割
调用context.write分成多行输出
输出的格式是
key = Text
value = IntWritable

3.对于reduce来说,map的输出的key相当于reduce的输入的key,但是map的输出的多个value聚合起来对应reduce的一个value
所以reduce的输入格式是
key = Text
value = 多个IntWritable
对多个IntWritable进行累加,计算单词个数。

4.一开始会报文件格式不对,不是sequence file。我查了一下,一般hbase的数据会用sequence file
不过检查了一下,我的输入文件是普通文本,所以应该这样写:
job.setInputFormatClass(TextInputFormat.class);
或者不写,默认是文本。

之后会报这个错:
14/06/23 13:06:10 INFO mapred.JobClient: Task Id : attempt_201402111944_0012_m_000000_0, Status : FAILED
java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.IntWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1024)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:690)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.luoyan.mapred.WordCount$WordCountMapper.map(WordCount.java:24)
at com.luoyan.mapred.WordCount$WordCountMapper.map(WordCount.java:20)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
实际是上因为
job.setMapOutputValueClass()设置得不对
要保证和函数参数一致。
另一方面
o.a.h.mapred与o.a.h.mapreduce是两套不兼容的API,后者的API更新,建议用新的API。
5.所以,最终的代码如下:
WordCount.java
package com.luoyan.mapred;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;

public class WordCount {
public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
System.out.println(“map key ” + key.toString() + ” value ” + value.toString());
String[] tokens = value.toString().split(” “);
for (String token : tokens) {
context.write(new Text(token),
new IntWritable(1));
}
}
}
public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
String values_string = “”;
for (IntWritable value : values) {
sum += value.get();
values_string = values_string + ” ” + value.get();
}

System.out.println(“reduce key ” + key.toString() + ” values [" + values_string + "]“);
context.write(new Text(key),
new IntWritable(sum));
}
}
private static void usage() {
System.out.println(“usage : command fromFile toFile reduceNum”);
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = new Configuration();
GenericOptionsParser genericOptionsParser = new GenericOptionsParser(configuration, args);
String[] otherArgs = genericOptionsParser.getRemainingArgs();
configuration = genericOptionsParser.getConfiguration();
System.out.println(“args.length ” + args.length);
if (args.length < 3) {
usage();
return;
}
String fromFile = otherArgs[0];
String toFile = otherArgs[1];
int reduceNum = Integer.parseInt(otherArgs[2]);

Job job = Job.getInstance(configuration, “word_count”);
job.setNumReduceTasks(reduceNum);
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

//job.setInputFormatClass(SequenceFileInputFormat.class);
//HDFSHelper.addInputPath(job, fromFile);
job.setInputFormatClass(TextInputFormat.class);
//job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(fromFile));
Path outputPath = new Path(toFile);
FileSystem fs = FileSystem.get(configuration);
if (fs.exists(outputPath)) {
fs.delete(outputPath, true);
}
FileOutputFormat.setOutputPath(job, new Path(toFile));

job.waitForCompletion(true);
}
}

pom.xml如下
<project xmlns=”http://maven.apache.org/POM/4.0.0″ xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd”>
<modelVersion>4.0.0</modelVersion>
<groupId>com.luoyan.mapred</groupId>
<artifactId>mapred</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>simpleMapReduce</name>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>

<build>

<!– build所需要的资源 –>
<resources>
<resource>
<directory>${project.basedir}/src/main/resources</directory>
<filtering>true</filtering>
</resource>
<!–
<resource>
<directory>${basedir}/src/main/resources</directory>
<includes>
<include>hadoop-production-lg/core-site.xml</include>
<include>hadoop-production-lg/hdfs-site.xml</include>
<include>hadoop-production-lg/krb5.conf</include>
</includes>
</resource>
–>
</resources>
<!– build所需要的资源 –>
<plugins>
<!– Bind the maven-assembly-plugin to the package phase this will create
a jar file without the storm dependencies suitable for deployment to a cluster. –>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
<fork>true</fork>
<verbose>true</verbose>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.basedir}/target/lib</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<downloadJavadocs>true</downloadJavadocs>
</configuration>
</plugin>
</plugins>
</build>
</project>

hdfs上
/user/text/a.txt如下
hello world hello xiaomi
i love xiaomi i love world

命令行启动如下:
~/hadoop/bin/hadoop jar target/mapred-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.luoyan.mapred.WordCount /user/test/a.txt /user/test/b.txt 2
内容就会放在/user/b.txt/下面

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值