实验内容
- 实现倒排索引效果:统计每个单词在不同文件中的出 现次数;
- 输入:自己编辑几个文件,例如 a.txt,b.txt,c.txt。 每个文件的内容为若干行单词,单词之间以空格分开, 并将这些文件上传到 hdfs 的/in 目录下;例如:a.txt 包含内容: hadoop google scau map hadoop reduce hive hello hbase
- 编写程序实现单词的倒排索引效果;
实现过程
工具:idea、xshell6、xftp。
在idea上创建maven工程:
添加hadoop依赖:
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.7</version>
</dependency>
</dependencies>
编写代码:
Mapper类,实现map方法:
package com.hadoop.invertedindex;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class InvertedMapper extends Mapper<LongWritable, Text, Text, Text> {
private static Text keyInfo = new Text();
private static final Text valueInfo = new Text("1");
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split(" ");
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String fileName = fileSplit.getPath().getName();
for (String field : fields) {
keyInfo.set(field + "->" + fileName);
context.write(keyInfo, valueInfo);
}
}
}
Combiner类,实现reduce方法:
package com.hadoop.invertedindex;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class InvertedCombiner extends Reducer<Text, Text, Text, Text> {
private static Text info = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (Text value : values) {
sum += Integer.parseInt(value.toString());
}
int splitIndex = key.toString().indexOf("->");
info.set(key.toString().substring(splitIndex + 1) + "->" + sum);
key.set(key.toString().substring(0, splitIndex));
context.write(key, info);
}
}
Reducer类,实现reduce方法:
package com.hadoop.invertedindex;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class InvertedReducer extends Reducer<Text, Text, Text, Text> {
private static Text result = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String fileList = new String();
for (Text value : values) {
fileList += value.toString() + "; ";
}
result.set(fileList);
context.write(key, result);
}
}
Index类,实现main方法:
package com.hadoop.invertedindex;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class InvertedIndex {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//创建一个任务job,指定入口类
Job job = Job.getInstance(new Configuration());
job.setJarByClass(InvertedIndex.class);
//指定任务job的mapper类及输出key和value的数据类型
job.setMapperClass(InvertedMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//指定任务job的combiner类
job.setCombinerClass(InvertedCombiner.class);
//指定任务job的reducer类
job.setReducerClass(InvertedReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
打包jar包(双击package):
生成jar包后,使用xftp传输到linux上(位置为/usr/local/hadoop/share/hadoop/mapreduce)
在Linux下:
启动hadoop分布式集群:
./sbin/start-all.sh
创建invertedindexinput文件夹:
hdfs dfs -mkdri /invertedindexinput
编写三个文本文件a.txt,b.txt,c.txt,然后存到invertedindexinput文件夹下:
vim a.txt
hdfs dfs -put ./a.txt /invertedindexinput
运行jar包程序:
先要进入到jar包所在的文件夹
cd /usr/local/hadoop/share/hadoop/mapreduce
hadoop jar MapReduce-1.0-SNAPSHOT.jar com.hadoop.invertedindex.InvertedIndex /invertedindexinput /invertedindexoutput
com.hadoop.invertedindex.InvertedIndex为函数名
/invertedindexinput为输入文件夹
/invertedindexoutput为输出文件夹
运行结果: