1)MapReduce项目引入jar包:hadoop-streaming-2.6.5.jar
2)main函数主要代码段:
JobConf jobconf = new JobConf(new Configuration(), MreMroParser.class);
jobconf.setJobName("xmlParser");
//这里标记使用流式输入
jobconf.set("stream.recordreader.class",StreamXmlRecordReader.class.getName());
//开始标记为<bulkPmMrDataFile>
jobconf.set("stream.recordreader.begin", "<bulkPmMrDataFile>");
//结束标记为</bulkPmMrDataFile>
jobconf.set("stream.recordreader.end", "</bulkPmMrDataFile>");
// 设置reduce的输出结果key和value用逗号分隔
jobconf.set("mapred.textoutputformat.ignoreseparator", "true");
jobconf.set("mapred.textoutputformat.separator", ",");
jobconf.setMapperClass(xmlParserMapper.class);
jobconf.setReducerClass(xmlParserReducer.class);
// 设置inputFormat
jobconf.setInputFormat(StreamInputFormat.class);
jobconf.setOutputFormat(TextOutputFormat.class);
jobconf.setOutputKeyClass(Text.class);
jobconf.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(jobconf, new Path(args[0]), StreamInputFormat.class,MreMroParserMapper.class);
FileOutputFormat.setOutputPath(jobconf, new Path(args[1]));
JobClient.runJob(jobconf);
3)Map函数xmlParserMapper.class核心代码:
public class MreMroParserMapper extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
@Override
/*
* Context实例用于输出内容的写入
* (non-Javadoc)
* @see org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN, org.apache.hadoop.mapreduce.Mapper.Context)
*/
public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
String xmlContent= key.toString();
System.out.println("'" + xmlContent+ "'");
/*自定义XML解析函数,将xmlContent送入*/
………………
我是使用dom4j:
Document document = DocumentHelper.parseText(xmlContent);
Element elementRoot = document.getRootElement();
解析后返回多记录List resultDatas
………………
处理多记录输出:
for(int i=0;i<resultDatas.size();i++){
String data = dataFormater.formatResultData(resultDatas.get(i));
Text text = new Text();
text.set(data);
output.collect(new Text(resultDatas.get(i).getId()), text);
}