hadoop在过滤重复数据的问题中出现了一些问题,没有将相同的数据去掉,而是排好序都呈现了出来,于是我又写了一个字符计数的程序,也是这种效果,没有将同一个key的value放在一起,效果图如下
这个是原始数据
这个是处理之后的数据
仅仅是将每行的数据进行切分了,没有将key相同的放在一起。
原始代码如下
package ccnu.eisr;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class DataDeduplicationMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//这是mapreduce读取到的一行字符串
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
//将单词输出为key,次数输出为value,这行数据会输到reduce中
context.write(new Text(word), new LongWritable(1));
}
}
}
package ccnu.eisr;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class DataDeduplicationReducer extends Reducer<Text, Text, Text, LongWritable>{
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
long count = 0;
for (LongWritable num : values) {
count += num.get();
}
context.write(key, new LongWritable(count));
}
}
package ccnu.eisr;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DataDeduplicationRunner {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job wcjob = Job.getInstance(conf,"dedup");
//设置wcjob中的资源所在的jar包
wcjob.setJarByClass(DataDeduplicationRunner.class);
//设置wcjob要使用的哪个mapper类
wcjob.setMapperClass(DataDeduplicationMapper.class);
//设置wcjob要使用的哪个reducer类
wcjob.setReducerClass(DataDeduplicationReducer.class);
wcjob.setCombinerClass(DataDeduplicationReducer.class);
//wcjob的mapper类输出的kv数据类型
wcjob.setMapOutputKeyClass(Text.class);
wcjob.setMapOutputValueClass(LongWritable.class);
//wcjob的reducer类输出的kv数据类型
wcjob.setOutputKeyClass(Text.class);
wcjob.setOutputValueClass(LongWritable.class);
//指定要处理的原始数据所存放的路径
FileInputFormat.setInputPaths(wcjob, "hdfs://127.0.0.1:9000/datadedup");
//指定要处理之后的结果输出到哪个路径
FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://127.0.0.1:9000/output"));
boolean res = wcjob.waitForCompletion(true);
System.exit(res?0:1);
}
}
尝试了很多种方法,感觉是不是机器出问题了,最坏的方法服务器都重启了,结果还是结果不对,经过一天多的折腾,最后灵感出现了,我将reducer的输出改了一下,发现结果还是没变,想到了可能是没有走我写的reducer,而是走了默认的reducer,因为reducer是继承了org.apache.hadoop.mapreduce.Reducer,可能是继承出现了问题,于是加上了覆盖,@Override,果然就报错了,说类型不匹配,仔细一看,还真是出问题了,都是自己的马虎,导致了这个小问题。
正确代码如下
package ccnu.eisr;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class DataDeduplicationMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//这是mapreduce读取到的一行字符串
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
//将单词输出为key,次数输出为value,这行数据会输到reduce中
context.write(new Text(word), new LongWritable(1));
}
}
}
package ccnu.eisr;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class DataDeduplicationReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
long count = 0;
for (LongWritable num : values) {
count += num.get();
}
context.write(key, new LongWritable(count));
}
}
package ccnu.eisr;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DataDeduplicationRunner {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job wcjob = Job.getInstance(conf,"dedup");
//设置wcjob中的资源所在的jar包
wcjob.setJarByClass(DataDeduplicationRunner.class);
//设置wcjob要使用的哪个mapper类
wcjob.setMapperClass(DataDeduplicationMapper.class);
//设置wcjob要使用的哪个reducer类
wcjob.setReducerClass(DataDeduplicationReducer.class);
wcjob.setCombinerClass(DataDeduplicationReducer.class);
//wcjob的mapper类输出的kv数据类型
wcjob.setMapOutputKeyClass(Text.class);
wcjob.setMapOutputValueClass(LongWritable.class);
//wcjob的reducer类输出的kv数据类型
wcjob.setOutputKeyClass(Text.class);
wcjob.setOutputValueClass(LongWritable.class);
//指定要处理的原始数据所存放的路径
FileInputFormat.setInputPaths(wcjob, "hdfs://127.0.0.1:9000/datadedup");
//指定要处理之后的结果输出到哪个路径
FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://127.0.0.1:9000/output"));
boolean res = wcjob.waitForCompletion(true);
System.exit(res?0:1);
}
}
总结一下,出现没有执行自己写的代码,很有可能是自己的方法名或者方法的参数不对,出现了方法重载的情况,这里一定要重写它原来的方法,类名一样,参数类型也要一样。最好在方法上加上@Override。