9.1实验要求
假定现有一个100GB的大表big.txt和一个1MB的小表small.txt,请基于MapReduce思想编程实现判断小表中的单词在大表中出现次数。所谓的“扫描大表,加载小表“。由于实验中没有100GB这样的大表,甚至1MB的小表都没有,因为本实验采用模拟方式,所以用少量数据代表大文件big.txt,更少量数据代表small.txt.
9.2实验
BigAndSmallTable.java
package lab9;
import java.io.IOException;
import java.util.HashSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.LineReader;
public class BigAndSmallTable {
public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{
private final static IntWritable one=new IntWritable(1);
private static HashSet<String> smallTable=null;
protected void setup(Context context)throws IOException,InterruptedException{
//小表加载到HashSet
smallTable=new HashSet<String>();
Path smallTablePath=new Path(context.getConfiguration().get("smallTableLocation"));
FileSystem hdfs=smallTablePath.getFileSystem(context.getConfiguration());
FSDataInputStream hdfsReader=hdfs.open(smallTablePath);
Text line=new Text();
LineReader lineReader=new LineReader(hdfsReader);
while(lineReader.readLine(line)>0) {
String[] values=line.toString().split(" ");
for(int i=0;i<values.length;i++) {
smallTable.add(values[i]);
System.out.println(values[i]);
}
}
lineReader.close();
hdfsReader.close();
System.out.println("setup ok");
}
public void map(Object key,Text value,Context context)throws IOException,InterruptedException{
String[] values=value.toString().split(" ");
for(int i=0;i<values.length;i++) {
if(smallTable.contains(values[i])) {
context.write(new Text(values[i]),one);
}
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
private IntWritable result=new IntWritable();
public void reduce(Text key,Iterable<IntWritable>values,Context context)throws IOException,InterruptedException{
int sum=0;
for(IntWritable val:values) {
sum+=val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args)throws Exception{
Configuration conf=new Configuration();
conf.set("smallTableLocation", args[1]);
Job job=Job.getInstance(conf,"BigAndSmallTable");
job.setMapperClass(TokenizerMapper.class );
job.setReducerClass(IntSumReducer.class );;
job.setMapOutputKeyClass(Text.class );
job.setMapOutputValueClass(IntWritable.class );
job.setOutputKeyClass(Text.class );
job.setOutputValueClass(IntWritable.class );
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
代码解析:
小表加载到HashSet中,然后用Mapper扫描大表
输入文件big.txt:
aaa bbb ccc ddd eee fff ggg
hhh iii jjj kkk lll mmm nnn
000 111 222 333 444 555 666 777 888 999
ooo ppp qqq rrr sss ttt
uuu vvv www xxx yyy zzz
输入文件small.txt:
eee sss 555
输出结果:
555 1
eee 1
sss 1
可以在不开启HDFS的eclipse环境下执行