一、开发环境:
1、系统:WIN7
2、IDE:Eclipse
3、Java:jdk1.6
二、所需jar包
1、lucene-core-3.1.0.jar
2、paoding-analysis.jar
3、数据词典 dic
三、集群环境
1、节点:Master(1),Slave(2)
2、系统:RedHat 6.2
3、JDK:jdk1.6
4、Hadoop: Hadoop1.1.2
5、Mahout: Mahout0.6
6、pig: pig0.11
四、数据准备
1、18.7M,8000+个模型文件
2、19.2M,9000+个测试文件
五、开发步骤
(一)、购建cbayes模型
1、模型文件由8000多个小文件组成,若用MapReduce默认的FileInputFormat读取时,将产生至少8000+个map任务,这样效率将非常低,为了处理小文件的问题,需要自定义FileInputFormat并extends CombineFileInputFormat (将多个小文件组合生成切片).
自定义的CombineFileInputFormat 和 RecordReader 代码如下:
1)、自定义的CombineFileInputFormat
package fileInputFormat;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
public class MyFileInputFormat extends CombineFileInputFormat<Text, Text>{
@Override
public RecordReader<Text,Text> createRecordReader(InputSplit split,TaskAttemptContext context) throws IOException {
CombineFileRecordReader<Text, Text> recordReader =new CombineFileRecordReader<Text, Text>((CombineFileSplit)split, context, MyFileRecordReader.class);
//返回自定义的RecordReader
return recordReader;
}
//要求一个文件必须在一个切片中,一个切片可以包含多个文件
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
}
2)、自定义的RecordReader
package fileInputFormat;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader;
import org.apache.hadoop.util.ReflectionUtils;
public class MyFileRecordReader extends RecordReader<Text,Text>{
private Text currentKey = new Text(); // 当前的Key
private Text currentValue = new Text(); // 当前的Value
private Configuration conf; // 任务信息
private boolean processed; // 记录当前文件是否已经读取
private CombineFileSplit split; //待处理的任务切片
private int totalLength; //切片包含的文件数量
private int index; //当前文件在split中的索引
private float currentProgress = 0; //当前的处理进度
public MyFileRecordReader(CombineFileSplit split, TaskAttemptContext context, Integer index) throws IOException {
super();
this.split = split;
this.index = index; // 当前要处理的小文件Block在CombineFileSplit中的索引
this.conf = context.getConfiguration();
this.totalLength = split.getPaths().length;
this.processed = false;
}
@Override
public void close() throws IOException {
}
@Override
public Text getCurrentKey() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return currentKey;
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return currentValue;
}
@Override
public float getProgress() throws IOException, InterruptedException {
if (index >= 0 &&