实验四：MapReduce中级编程实践

最新推荐文章于 2024-05-29 11:06:45 发布

余诚诚诚诚

最新推荐文章于 2024-05-29 11:06:45 发布

阅读量543

点赞数

文章标签： mapreduce hadoop 大数据

本文链接：https://blog.csdn.net/qq_46023947/article/details/121041769

版权

一、实验目的

通过实验掌握基本的MapReduce编程方法；
掌握用MapReduce解决一些常见的数据处理问题，包括数据去重计数、数据排序。

二、实验平台

操作系统：Linux
Hadoop版本：3.2.2

三、实验步骤

文件有点大只取用了部分数据

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
public class file {
    public static void main(String[] args){
        File inFile = new File("/home/yuchen/user.csv"); // 读取的CSV文件
        File outFile = new File("/home/yuchen/file.csv");//写出的CSV文件
        try {
            BufferedReader reader = new BufferedReader(new FileReader(inFile));
            BufferedWriter writer = new BufferedWriter(new FileWriter(outFile));
            String str;
            int num=0;
            while((str=reader.readLine())!= null&&num++<=100){
                String[] strSplit= str.split("\t");
                writer.write(strSplit[2]+"\t"+strSplit[16]+"\t"+strSplit[24]+"\t"+strSplit[25]);
                writer.newLine();
            }
            reader.close();
            writer.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

（一）对访问同一个网站的用户去重计数。

注：文件userurl_20150911中，数据以”\t”隔开，用户手机号为第三列，网站主域为第17列

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MergeSort {
    public static class Map extends Mapper<Object, Text, Text, IntWritable>{
        public void map(Object key, Text value, Context context) throws IOException,InterruptedException{
            String line = value.toString();
            if (line != null && !line.equals("")) {
            String[] strSplit= line.split("\t");
            context.write(new Text(strSplit[0]+"\t"+strSplit[1]), new IntWritable(1));
           }
        }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>{
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException{
            int num = 0;
            for(IntWritable val : values) {
                num++;
            }
            context.write(key, new IntWritable(num));
        }
    }

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        Job job = Job.getInstance(conf,"Merge ");
        job.setJarByClass(MergeSort.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/input2/file"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/output2"));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

在这里插入图片描述

（二）对同一个用户不同记录产生的上下行流量求和后进行排序输出。

我认为用户访问同一网站产生的不同流量也算不同记录，于是在reduce阶段直接进行了排序输出
注：上行流量位于第25列，下行流量位于第26列

import java.io.IOException;
import java.util.Arrays;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MergeSort {
    public static class Map extends Mapper<Object, Text, Text, IntWritable>{
        public void map(Object key, Text value, Context context) throws IOException,InterruptedException{
            String line = value.toString();
            if (line != null && !line.equals("")) {
            String[] strSplit= line.split("\t");
            context.write(new Text(strSplit[0]+"\t"+strSplit[1]), new IntWritable(Integer.parseInt(strSplit[2]) + Integer.parseInt(strSplit[3])));
           }
        }
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>{
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException {
            int size = 10;
            int[] arr= new int[size];
            int i = 0;
            for (IntWritable val : values) {
                arr[i]=val.get();
                i++;
            }
            Arrays.sort(arr);
            for (int j=size-1;j>=size-i;j--) {
                context.write(key, new IntWritable(arr[j]));
            }
        }
    }

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        Job job = Job.getInstance(conf,"Merge ");
        job.setJarByClass(MergeSort.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/input2/file"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/output3"));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

在这里插入图片描述

四、实验总结及问题

1、学会使用什么做什么事情；
2、在实验过程中遇到了什么问题？是如何解决的？
3、还有什么问题尚未解决？可能是什么原因导致的。

注意：
1、请大家将实验结果提交到学习通；
2、直接发送Word文档，不要使用压缩文件；
3、文件名为：学号-姓名-实验几

余诚诚诚诚

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
实验四：MapReduce中级编程实践

一、实验目的通过实验掌握基本的MapReduce编程方法；掌握用MapReduce解决一些常见的数据处理问题，包括数据去重计数、数据排序。二、实验平台操作系统：LinuxHadoop版本：3.2.2三、实验步骤（一）对访问同一个网站的用户去重计数。注：文件userurl_20150911中，数据以”\t”隔开，用户手机号为第三列，网站主域为第17列（二）对同一个用户不同记录产生的上下行流量求和后进行排序输出。注：上行流量位于第25列，下行流量位于第26列四、实验总结及
复制链接

扫一扫