实验目的
1、了解Mapreduce模型
2、了解Hadoop JAVA API的基本用法
3、了解数据分析的基本流程
实验环境
1、Linux Ubuntu 14.04
2、hadoop-2.6.0-cdh5.4.5
3、hadoop-2.6.0-eclipse-cdh5.4.5.jar
4、eclipse-java-juno-SR2-linux-gtk-x86_64
实验内容
1、现有一个搜索引擎,每天支撑大量的搜索服务。
2、用户使用搜索引擎,搜索数据时,搜索引擎会产生一条日志,记录用户的搜索信息。
3、现有搜索日志文件,放在/data/mydata/目录下,名为solog,格式如下:(时间、ip、屏幕宽度、屏幕高度、当前搜索词、上一次搜索词)
- 20161220155843 192.168.1.179 1366 768 Sqoop HTML
- 20161220155911 192.168.1.179 1366 768 SparkR Sqoop
- 20161220155914 192.168.1.155 1600 900 hadoop
- 20161220155921 192.168.1.155 1600 900 hahahhaha hadoop
- 20161220155928 192.168.1.155 1600 900 sqoop hahahhaha
4、编写MapReduce程序,统计每个搜索词分别被搜索了多少次。(WordCount)
5、结果数据样式
- AI 2
- AR 2
- AWK 2
- Apache 2
- CSS 4
- Cassandra 2
- DataMining 7
- Docker 2
- ETL 4
- Echarts 6
- Flume 6
- HDFS 6
实验步骤
1、打开eclipse,新建项目
选择Map/Reduce Project
在弹出的窗口中,输入项目的名称为mapreducedemo,以及项目的存储目录为/data/myjava目录下
另外在此处还需要指定Hadoop的路径,点击“Configure Hadoop Install directory”,弹出窗口,选择hadoop程序的安装位置。操作完毕后点击Next
再点击Finish则创建项目完毕
2.下面右键mapreducedemo下src目录,新建package
弹出的窗体中,输入包名my.mr
在my.mr下新建一个执行wordcount的类,名为MyWordCount
3.搜索日志solog文件已放在/data/mydata/目录 下,请将/data/mydata/solog文件,上传到HDFS上/mydata/目录下。
如果hdfs上/mydata目录不存在的话,需要提前创建
- hadoop fs -mkdir /mydata
上传文件
- hadoop fs -put /data/mydata/solog /mydata/solog
4、下面编写MapReduce代码,MapReduce代码基本结构,都比较类似,结构如下。
- package my.mr;
- import java.io.IOException;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- public class MyWordCount {
- public static void main(String[] args) {
- }
- public static class doMapper extends Mapper<Object, Text, Text, IntWritable>{
- @Override
- protected void map(Object key, Text value, Context context)
- throws IOException, InterruptedException {
- }
- }
- public static class doReducer extends Reducer<Text,IntWritable, Text, IntWritable>{
- @Override
- protected void reduce(Text key, Iterable<IntWritable> values, Context context)
- throws IOException, InterruptedException {
- }
- }
- }
main函数,是MapReduce进行任务处理的入口。
doMapper类,自定义的Map类
doReducer类,自定义的Reduce类
完整代码如下:
- package my.mr;
- import java.io.IOException;
- import java.util.StringTokenizer;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class MyWordCount {
- public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
- Job job = Job.getInstance();
- job.setJobName("MyWordCount");
- job.setJarByClass(MyWordCount.class);
- job.setMapperClass(doMapper.class);
- job.setReducerClass(doReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- Path in = new Path("hdfs://localhost:9000/mydata/solog");
- Path out = new Path("hdfs://localhost:9000/myout/1");
- FileInputFormat.addInputPath(job, in);
- FileOutputFormat.setOutputPath(job, out);
- System.exit(job.waitForCompletion(true) ? 0 : 1);
- }
- public static class doMapper extends Mapper<Object, Text, Text, IntWritable>{
- public static final IntWritable one = new IntWritable(1);
- public static Text word = new Text();
- @Override
- protected void map(Object key, Text value, Context context)
- throws IOException, InterruptedException {
- StringTokenizer tokenizer = new StringTokenizer(value.toString(), "\t");
- if (tokenizer.hasMoreTokens()) {
- tokenizer.nextToken();
- tokenizer.nextToken();
- tokenizer.nextToken();
- tokenizer.nextToken();
- }
- while (tokenizer.hasMoreTokens()) {
- word.set(tokenizer.nextToken());
- context.write(word, one);
- }
- }
- }
- public static class doReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
- private IntWritable result = new IntWritable();
- @Override
- protected void reduce(Text key, Iterable<IntWritable> values, Context context)
- throws IOException, InterruptedException {
- int sum = 0;
- for (IntWritable value : values) {
- sum += value.get();
- }
- result.set(sum);
- context.write(key, result);
- }
- }
- }
5、在WordCountText类文件中,右键并点击Run As =》 Run on Hadoop选项,将MapReduce任务提交到Hadoop集群中。
待执行完毕后,进入命令行模式下,查看HDFS目录结构变化
- hadoop fs -lsr /myout
查看HDFS上,/myout/1目录下输出的内容
- hadoop fs -text /myout/1/*
可以看到具体的输出结果