【大数据学习之路之Hadoop】

一、配置idea开发环境

通过 idea 远程链接 hadoop 集群环境配置方法,本文包含 步骤:

1、下载所需文件

https://pan.baidu.com/s/1o1UJe1M2x4kIpWIRD1zNFw?pwd=6btq
hadoop-2.6.1.tar.gz
hadoop2.6.5-on-windows-winutils_X64.zip
jdk-8u231-windows-x64.exe
maven:https://topabu.lanzout.com/iDVGk0e1fr8b

2、安装 jdk

windows是一键安装,直接下一步,配置环境变量即可。
配置环境变量JAVA_HOME,CLASSPATH,再配置个PATH

Mac和Linux修改~/.bashrc

vim  /etc/profile

vim  ~/.bashrc
#set java environment
export JAVA_HOME=/usr/local/src/jdk1.8.0_371
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=$JAVA_HOME/bin:$PATH

完成后,使用java -version命令检查Java是否安装成功

3、安装 maven

1、解压maven后,配置系统变量,右键:此电脑:高级系统设置:环境变量 变量名:M2_HOME 变量值:F:\soft\maven-3.3.9(自己磁盘下maven的地址)

2、修改本地仓库地址,文件地址:F:\soft\maven-3.3.9\conf\settings.xm 在 F:\soft 目录下创建 mvn_repo 文件夹,复制该路径 F:\soft\mvn_repo 替换 F:\maven-repository,替换完成,保存关闭该 xml 文件

完成后,使用mvn-version命令检查maven是否安装成功

4、安装 hadoop

4.1 下载Hadoop安装包
4.2 解压 hadoop2.6.5-on-windows-winutils_X64.zip,可以根据自己软件安装喜好位置安装,这里我选择安装到 F 盘中 soft 目录下
4.3 将 hadooponwindows-master 中 bin 目录下的文件全部复制到 hadoop- 2.6.1 中 bin 目录下,选择 替换目标中的文件
4.4 复 制 hadoop-2.6.1 中 bin 目 录 下 的 hadoop.dll 文 件 , 粘 贴 到
C:\Windows\System32 目录中
4.5 配置环境变量,右键:此电脑:高级系统设置:环境变量 变量名:HADOOP_HOME 变量值:F:\soft\hadoop-2.6.1
4.6 添加 Path,在系统变量里找到 Path,点击选中,点编辑,新建,输入: %HADOOP_HOME%\bin
4.7 修改 windows 中 hosts 文件,设置本地 ip 映射 文件路径: C:\Windows\System32\drivers\etc\hosts (如果已经配置过,可 忽略此步骤) 修改格式如下图:每个节点对应的 ip 地址 + 空格 + 节点 hostname

测试成功:打开 cmd,输入:hadoop version

5、idea 中设置 jdk、maven

配置环境变量,idea在设置中配置maven,不做过多赘述

二、homework_day03

  • 作业1: 10个案例 (原理、代码阅读会最基础)
    py方式实现10个案例https://www.freesion.com/article/6967750533/
  • 作业2:如何利用mr做抽样,比例1%。 (自己的作业,写一遍)
  • 作业3:MapReduce实现。抽k(4)组数据,每组数据可以若干数据,然后对每组数据求均值,作为初始化聚类中心。(进阶)
  • 作业4:hadoop案例分析ppt最后一页 ,然后基于聚类结果对数据进行区域热门top10推荐。(项目最难)有bug需要找到

10个MR案例代码练习

写下代办:

  • 1、wordCount
  • 2、ip去重
  • 3、分组求平均值
  • 4、 求最大最小值
  • 5、序列化机制——购物金额统计
  • 6、分区——按地区分为三个分区
  • 7、combiner——对wordCount进行combiner操作
  • 8、排序和全局排序
  • 9、多文件合并
  • 10、多级MR
  • 11、相似好友查询

1、wordCount

思路:map进行切割,排序,分发给reduce,在reduce上再进行总数的计算。

代码分析:MyWordCountMapper.java

package hadoop_test.word_count_demo_01;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

/**
 * 实现自己的WordCountMapper
 * 将数据进行split切割后进行partition和排序
 */
public class MyWordCountMapper extends Mapper<LongWritable,Text ,Text,LongWritable> {

    private static final String SPLIT_MARK = " ";
    private static final long DEFINE_LENGTH = 1;

    /**
     * 重写map方法
     * @param key 偏移量
     * @param value 文本的一行内容
     * @param context 上下文 负责管理的读取hdfs 数据信息,计算上下文
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {

        //输出为字符串类型
        String line = value.toString();

        //根据SPLIT_MARK进行字符串切割,分割为["word1","word2","word3",.....]
        String[] words = line.split(SPLIT_MARK);
        //对所有word进行遍历取出
        for (String word :
                words) {
            //将word分发给reduce
            context.write(new Text(word),new LongWritable(DEFINE_LENGTH));
        }
    }
}

代码分析:MyWordCountReducer.java

package hadoop_test.word_count_demo_01;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

/**
 * 实现自己的MyWordCountReducer
 * 将map分发后的数据进行merge
 */
public class MyWordCountReducer extends Reducer<Text, LongWritable, Text,LongWritable> {

    /**
     * 重写reduce方法
     * @param key 唯一的key
     * @param values map分发后的迭代器
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {

        long count = 0L;
        //拿到map传过来的值应该是[{word1,[1,1,1,1]},{word2,[1,1]},..... ]
        for (LongWritable value : values) {
            //get到基础数据类型
            long num = value.get();
            //数据累加
            count += num;
        }
        context.write(key,new LongWritable(count));
    }
}

新增驱动:WordCountDriver.java

package hadoop_test.word_count_demo_01;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import hadoop_test.Utils_hadoop;
public class WordCountDriver {
    //定义常量读取地址
    private static final String READ_HDFS_PATH = "/hadoop_test/word_count/acticle.txt";
    //定义常量输出结果地址
    private static final String WRITE_HDFS_PATH = "/hadoop_test/word_count/word_count_result";

    public static void main(String[] args) throws Exception {
//        hadoop jar study_demo.jar hadoop_test.word_count_demo_01.WordCountDriver
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);


        //程序入口,driver类
        job.setJarByClass(WordCountDriver.class);
        //设置mapper的类(蓝图,人类,鸟类),实例(李冰冰,鹦鹉2号),对象

        job.setMapperClass(MyWordCountMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setReducerClass(MyWordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
//       需要指定combine的实现类,不用指定输出key和value类型
//        job.setCombinerClass(WordCountCombine.class);
//        输入文件
        FileInputFormat.setInputPaths(job, new Path(READ_HDFS_PATH));

        //向hdfs中写文件
        if( Utils_hadoop.testExist(conf,WRITE_HDFS_PATH)){
            Utils_hadoop.rmDir(conf,WRITE_HDFS_PATH);}
        FileOutputFormat.setOutputPath(job, new Path(WRITE_HDFS_PATH));
        job.waitForCompletion(true);

    }

}

总结:在写mapper的时候注意范型中的KEYIN,KEYOUT,VALUEIN,VALUEOUT的类型,根据实际情况而定,此案例中的第一个keyin则为随机数,keyout则为一行数据,以行为基准。

2、ip去重

思路:有N个ip需要去重,直接依次放进reduce即可。

代码实现,依次为mapper,reducer,driver

package hadoop_test.data_duplicate_demo_02;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class MyIPDeduplicationMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
        System.out.println("=====mapper=====key====="+key);
        System.out.println("=====mapper=====value====="+value);
        context.write(value, NullWritable.get());
    }
}
package hadoop_test.data_duplicate_demo_02;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyIPDeduplicationReducer extends Reducer<Text, NullWritable,Text,NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Reducer<Text, NullWritable, Text, NullWritable>.Context context) throws IOException, InterruptedException {
        System.out.println("=====reducer=====key====="+key);
        context.write(key,NullWritable.get());
    }
}
package hadoop_test.data_duplicate_demo_02;


import hadoop_test.Utils_hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DupDriver {
/*
192.168.234.21
192.168.234.22
192.168.234.21
192.168.234.21
192.168.234.23
192.168.234.21
192.168.234.21
192.168.234.21
192.168.234.25
192.168.234.21
192.168.234.21
192.168.234.26
192.168.234.21
192.168.234.27
192.168.234.21
192.168.234.27
192.168.234.21
192.168.234.29
192.168.234.21
192.168.234.26
192.168.234.21
192.168.234.25
192.168.234.25
192.168.234.21
192.168.234.22
192.168.234.21
 */
    //定义常量读取地址
    private static final String READ_HDFS_PATH = "/hadoop_test/dup/dup.txt";
    //定义常量输出结果地址
    private static final String WRITE_HDFS_PATH = "/hadoop_test/dup/dup_result";

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(DupDriver.class);

        job.setMapperClass(MyIPDeduplicationMapper.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(READ_HDFS_PATH));
        job.setReducerClass(MyIPDeduplicationReducer .class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        FileOutputFormat.setOutputPath(job, new Path(WRITE_HDFS_PATH));
        if( Utils_hadoop.testExist(conf,WRITE_HDFS_PATH)){
            Utils_hadoop.rmDir(conf,WRITE_HDFS_PATH);}
        FileOutputFormat.setOutputPath(job, new Path(WRITE_HDFS_PATH));
        job.waitForCompletion(true);

        job.waitForCompletion(true);
    }
}

总结:这个案例不是很有难度,主要在于map的value的类型需要考虑,无需添加整型1,来进行计算,因为案例目的只是去重,交给reduce就可以进行去重,所以value的类型直接放入NullWritable。

3、分组求平均值

tom 69
tom 88
tom 78
jary 109
jary 90
jary 81
jary 35
rose 23
rose 100
rose 230

思路:以class为分组,在reduce进行累加后进行相除。

代码分析:MyGroupAvgMapper.java

package hadoop_test.avg_demo_03;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyGroupAvgMapper extends Mapper<LongWritable,Text,Text,LongWritable> {
    private static final String SPLIT_MARK = " ";

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
        //class01 123   取第一行数据
        String line = value.toString();

        //切割字符串
        String[] splitValue = line.split(SPLIT_MARK);
        //切割后第一个元素属于传给reduce的key
        String outKey = splitValue[0];

        //切割后第2个元素属于传给reduce的value
        long outValue = Long.parseLong(splitValue[1]);
        context.write(new Text(outKey), new LongWritable(outValue));
    }
}

代码分析:MyGroupAvgReducer.java

package hadoop_test.avg_demo_03;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyGroupAvgReducer extends Reducer<Text, LongWritable,Text, DoubleWritable> {
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, DoubleWritable>.Context context) throws IOException, InterruptedException {

        //定义总分数
        int totalScores = 0;

        //定义这个组有多少个
        int num = 0;

        //对map传过来对values进行遍历取值后进行累加,为当前分组的所有分数集合
        //[{class1:[98,99,97]},{class2:[92,93,97]},....]
        for (LongWritable tempScore :
                values) {
            num += 1;
            long score = tempScore.get();
            totalScores += score;
        }
        //取平均值
        double avg = totalScores/num;
        context.write(key,new DoubleWritable(avg));
    }
}

AvgDriver.java

package hadoop_test.avg_demo_03;

import hadoop_test.Utils_hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class AvgDriver {
/*
tom 69
tom 88
tom 78
jary 109
jary 90
jary 81
jary 35
rose 23
rose 100
rose 230
 */
    //定义常量读取地址
    private static final String READ_HDFS_PATH = "/hadoop_test/avg/avg.txt";
    //定义常量输出结果地址
    private static final String WRITE_HDFS_PATH = "/hadoop_test/avg/r1";

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(AvgDriver.class);
        job.setMapperClass(AvgMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        FileInputFormat.setInputPaths(job, new Path(READ_HDFS_PATH));
        job.setReducerClass(AvgReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);
        FileOutputFormat.setOutputPath(job, new Path(WRITE_HDFS_PATH));
        if( Utils_hadoop.testExist(conf,WRITE_HDFS_PATH)){
            Utils_hadoop.rmDir(conf,WRITE_HDFS_PATH);}
        FileOutputFormat.setOutputPath(job, new Path(WRITE_HDFS_PATH));
        job.waitForCompletion(true);

        job.waitForCompletion(true);
    }
}

执行结果:

jary	78.0
rose	117.0
tom		78.0

总结:和wordCount基本类似,但是有一点需要注意,写代码的时候,误将返回的平均值写为IntWritable,看到了案例代码才做了修改,平均值应为浮点数,此处应该注意。

4、 求最大最小值

大概题目:
2329999919500515070000
9909999919500515120022
9909999919500515180011

中间几位是年份,最后几位是温度,求最大最小值。
(ps:代码中,会有完整题目)
思路:首先在map中的keyout应该是年份,Text或者IntWritable都可以,valueout应该是温度,应该是IntWritable。在reduce中拿到应该是1950 [0001,0002…]格式的数据,在reduce中进行最大最小值的计算。

代码分析:MyMinAndMaxMapper.java

package hadoop_test.min_max_demo_04;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * 求当年温度最大最小值
 * 假设我们需要处理一批有关天气的数据,其格式如下:
 * 按照ASCII码存储,每行一条记录。每行共24个字符(包含符号在内)
 * todo 题目错误,应该是19、20、21、22
 * 第9、10、11、12字符为年份,第20、21、22、23字符代表温度,求每年的最高温度
 *
 * 2329999919500515070000
 * 9909999919500515120022
 * 9909999919500515180011
 * 9509999919490324120111
 * 6509999919490324180078
 * 9909999919370515070001
 * 9909999919370515120002
 * 9909999919450515180001
 * 6509999919450324120002
 * 8509999919450324180078
 */
public class MyMinAndMaxMapper extends Mapper<LongWritable,Text,Text,IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        //取第一行数据
        //String line = key.toString();
        String line = value.toString();
        String year = line.substring(8,12);//年份
        String temperate = line.substring(18,21);//温度
		//String temperate = line.substring(18,22);//温度
		//一开始写成了22,因为出题错误,后边一直报字符串越界异常,发现了问题,将其修改为了21,然后看了参考答案,String temperate = line.substring(18);
		//最后一位可以这样写
		
        //输出到reduce
        /*
        输出结果为:
        1945 0030
        1950 0040
        ...
         */
        context.write(new Text(year),new IntWritable(Integer.parseInt(temperate)));
    }

    public static void main(String[] args) {
        String line = "9909999919450515182345";
        String year = line.substring(8, 12);//年份
        String temperate = line.substring(18,22);//温度
        System.out.println(""+year+"----------"+Integer.parseInt(temperate));
    }
}

遇到问题:

Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 12

检查代码,发现8-12并没有错啊,debug发现:

String line = key.toString();

这一行,居然写成了key.toString(),然后赶紧改了过来,不报错了,直接😭,所以虽然简单的地方也要很注意!!!!

注释中有进行讲解。
代码分析:MyMinAndMaxReducer.java

package hadoop_test.min_max_demo_04;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Comparator;

public class MyMinAndMaxReducer extends Reducer<Text, IntWritable, Text,Text> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, Text>.Context context) throws IOException, InterruptedException {
        /*
            拿到结果:
            1945 [0030,0031,0032...]
            1950 [0001,0002,0030...]
            ...
         */
       /* (标准答案中的解法)
       
        int max = 0;//当这个值小于当前温度,说明还有更大的
        int min = Integer.MAX_VALUE;//当这个值大于当前温度,说明还有更小的
        for (IntWritable temp :
                values) {
            int temperature = temp.get();//取到温度int值
            if (max < temperature){
                max = temperature;
            }
            if (min > temperature){
                min = temperature;
            }
            String outValue = min+"-"+max;//拼接字符串outvalue
            context.write(key,new Text(outValue));
        }*/
        ArrayList<Integer> sortList = new ArrayList<>();
        for (IntWritable temp :
                values) {
            int temperature = temp.get();//取到温度int值
            //遍历放到新的数组中去
            sortList.add(temperature);
        }

        //利用jdk自带方法进行数组内的排序,和自己进行比较
        sortList.sort(Comparator.comparingInt(o -> o));
        //由于自然排序就是从大到小,取第一个元素(min)和最后一个元素(max)即可
        int min = sortList.get(0);
        int max = sortList.get(sortList.size()-1);
        String outValue = min+"-"+max;//拼接字符串outvalue
        context.write(key,new Text(outValue));
    }
}

遇到问题:
一开始一直想着是取出来值再进行add到Arraylist中去再进行排序,取最大最小,这样可以做出来,就是代码中的解法,这里遇到了数组越界的问题:

Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2

发现

        int max = sortList.get(sortList.size());

这一行代码直接写成了size,这样就会造成了数组越界,由于索引从0开始,改成

        int max = sortList.get(sortList.size()-1);

不再报错,执行MaxDriver.java

package hadoop_test.min_max_demo_04;

import hadoop_test.ConstantPool;
import hadoop_test.Utils_hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class MaxDriver {

	/*
		执行结果:
		1937	2:1
		1945	78:1
		1949	111:78
		1950	22:0

		1937	1-2
		1945	1-78
		1949	78-111
		1950	0-22
	 */
	public static void main(String[] args) throws Exception {
		Configuration conf=new Configuration();
		Job job=Job.getInstance(conf);
		

		job.setJarByClass(MaxDriver.class);

		job.setMapperClass(MyMinAndMaxMapper.class);

		job.setReducerClass(MyMinAndMaxReducer.class);

		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(IntWritable.class);
		

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(Text.class);
		

		FileInputFormat.setInputPaths(job,new Path(ConstantPool.MINANDMAX_READ_HDFS_PATH));

		//向hdfs中写文件
		if( Utils_hadoop.testExist(conf, ConstantPool.MINANDMAX_WRITE_HDFS_PATH)){
			Utils_hadoop.rmDir(conf, ConstantPool.MINANDMAX_WRITE_HDFS_PATH);}
		FileOutputFormat.setOutputPath(job,new Path(ConstantPool.MINANDMAX_WRITE_HDFS_PATH));

		job.waitForCompletion(true);

	}
}

这里可以注意到,从第四个案例开始加入了ConstantPool,因为发现为了代码的规范性符合阿里巴巴代码规范,还是需要设置一个常量池子,用于存放所有类中的各个地址常量以及逗号等各种符号,这里附带到此处的ConstantPool的代码,后面大同小异,不再进行展示

package hadoop_test;

/**
 * 常量池,用于定义常用的常量
 */
public abstract class ConstantPool {
    //定义特殊符号,数字,长度
    public static final String SPACE_SIGNAL = " ";
    public static final String TABS_SIGNAL = "\t";
    public static final String ENGLISH_COMMA_SIGNAL = ",";
    public static final String CHINESE_COMMA_SIGNAL = ",";
    public static final long DEFINE_LENGTH = 1;


    //定义常量读取地址
    public static final String WORDCOUNT_READ_HDFS_PATH = "/hadoop_test/word_count/acticle.txt";
    public static final String DUP_READ_HDFS_PATH = "/hadoop_test/dup/dup.txt";
    public static final String AVG_READ_HDFS_PATH = "/hadoop_test/avg/avg.txt";
    public static final String MINANDMAX_READ_HDFS_PATH = "/hadoop_test/min_max/max.data";

    //定义常量输出结果地址
    public static final String WORDCOUNT_WRITE_HDFS_PATH = "/hadoop_test/word_count/word_count_result";
    public static final String DUP_WRITE_HDFS_PATH = "/hadoop_test/dup/dup_result";
    public static final String AVG_WRITE_HDFS_PATH = "/hadoop_test/avg/r1";
    public static final String MINANDMAX_WRITE_HDFS_PATH = "/hadoop_test/min_max/result";


}

附带执行后的结果:

[root@master sbin]# hadoop fs -cat /hadoop_test/min_max/result/part-r-00000
23/06/04 05:15:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

1937	1-2
1945	1-78
1949	78-111
1950	0-22

5、序列化机制——购物金额统计

案例5:统计购物金额
	   phone      address  name    consum
	13877779999 	bj 		zs 		2145
	13766668888 	sh 		ls 		1028
	13766668888 	sh 		ls 		9987
	13877779999 	bj 		zs 		5678
	13544445555 	sz 		ww 		10577
	13877779999 	sh 		zs 		2145
	13766668888 	sh 		ls 		9987
	13877779999 	bj 		zs 		2184
	13766668888 	sh 		ls 		1524
	13766668888 	sh 		ls 		9844
	13877779999 	bj 		zs 		6554
	13544445555 	sz 		ww 		10584
	13877779999 	sh 		zs 		21454
	13766668888 	sh 		ls 		99747

思路:一开始认为这就是相当于word Count,根据空格进行切割,index=2的时候是名称,index=3的时候是消费。
直接写出来了:

 @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] userValues = line.split(ConstantPool.SPACE_SIGNAL);//根据空格进行切割字符串,取出来分别是四个字段
        String name = userValues[2];
        String consume = userValues[3];
        context.write(new Text(name),new IntWritable(Integer.parseInt(consume)));
    }

 @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        // zs [100,200,300..]
        int totalConsume = 0;
        for (IntWritable i :
                values) {
            int consume = i.get();
            totalConsume += consume;
        }
        context.write(key,new IntWritable(totalConsume));
    }

执行结果:

ls	132117
ww	21161
zs	40160

看了参考答案后:
本案例并非考察wordCount,而是序列化的问题,需要进行新建实体类,类中需要进行序列化和反序列化方法的重写,对代码进行重构,加入FlowBean。
代码分析:MyAvroMapper.java

package hadoop_test.avro_test_05.flow;

import hadoop_test.ConstantPool;
import hadoop_test.avro_test_05.domain.FlowBean;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/*
	   phone      address  name    consum
	13877779999 	bj 		zs 		2145
	13766668888 	sh 		ls 		1028
	13766668888 	sh 		ls 		9987
	13877779999 	bj 		zs 		5678
	13544445555 	sz 		ww 		10577
	13877779999 	sh 		zs 		2145
	13766668888 	sh 		ls 		9987
	13877779999 	bj 		zs 		2184
	13766668888 	sh 		ls 		1524
	13766668888 	sh 		ls 		9844
	13877779999 	bj 		zs 		6554
	13544445555 	sz 		ww 		10584
	13877779999 	sh 		zs 		21454
	13766668888 	sh 		ls 		99747

*/
public class MyAvroMapper extends Mapper<LongWritable,Text, Text, FlowBean> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {
       /* String line = value.toString();
        String[] userValues = line.split(ConstantPool.SPACE_SIGNAL);//根据空格进行切割字符串,取出来分别是四个字段
        String name = userValues[2];
        String consume = userValues[3];
        //new一个对象
        context.write(new Text(name),new IntWritable(Integer.parseInt(consume)));*/
        String line = value.toString();
        String[] userValues = line.split(ConstantPool.SPACE_SIGNAL);//根据空格进行切割字符串,取出来分别是四个字段
        String phone = userValues[0];
        String addr = userValues[1];
        String name = userValues[2];
        String consume = userValues[3];
        //new一个对象,把值都塞进去
        FlowBean flowBean = new FlowBean();
        flowBean.setPhone(phone);
        flowBean.setAdd(addr);
        flowBean.setName(name);
        flowBean.setConsum(Long.parseLong(consume));
        context.write(new Text(name),flowBean);
    }
}

MyAvroReducer.java

package hadoop_test.avro_test_05.flow;

import hadoop_test.avro_test_05.domain.FlowBean;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyAvroReducer extends Reducer<Text, FlowBean,Text,FlowBean> {
    /*@Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        // zs [100,200,300..]
        int totalConsume = 0;
        for (IntWritable i :
                values) {
            int consume = i.get();
            totalConsume += consume;
        }
        context.write(key,new IntWritable(totalConsume));
    }*/

    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context) throws IOException, InterruptedException {
        // zs [100,200,300..]
        long totalConsume = 0;
        FlowBean newFlowBean = new FlowBean();
        for (FlowBean oldFlowBean :
                values) {
            long consume = oldFlowBean.getConsum();
            totalConsume += consume;
            newFlowBean.setName(oldFlowBean.getName());
            newFlowBean.setPhone(oldFlowBean.getPhone());
            newFlowBean.setAdd(oldFlowBean.getAdd());
            //算总金额
            newFlowBean.setConsum(totalConsume);
        }
        context.write(key,newFlowBean);
    }
}

输出结果:

ls	FlowBean [phobe=13766668888,add=sh,name=ls,consum=132117]
ww	FlowBean [phobe=13544445555,add=sz,name=ww,consum=21161]
zs	FlowBean [phobe=13877779999,add=bj,name=zs,consum=40160]

总结:算出来了总金额,但是觉得有些地方不对,因为会进行地区的覆盖和名字的覆盖,假如名字一样,地区不一样,后一个就会对前一个地区的进行覆盖,严格意义上来说,这两个并不算同一个人,但是系统会认定他为同一个人进行覆盖。
所以我认为结果应该是:
zs. FlowBean [phobe=13877779999,add=bj,name=zs,consum=xxx]
zs. FlowBean [phobe=13877779999,add=sh,name=zs,consum=xxx]
ls FlowBean [phobe=13766668888,add=sh,name=ls,consum=132117]
ww FlowBean [phobe=13544445555,add=sz,name=ww,consum=21161]

6、分区——按地区分为三个分区

还是上一个案例,按地区分为三个分区:
需要添加一个partition:
MyPartitioner.java

package hadoop_test.partition_test_06.flow;

import hadoop_test.partition_test_06.domain.FlowBean;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner<Text, FlowBean> {
    @Override
    public int getPartition(Text text, FlowBean flowBean, int numPartitions) {
        if ("bj".equals(flowBean.getAddr())){
            return 0;
        }else if ("sh".equals(flowBean.getAddr())){
            return 1;
        }else {
            return 2;
        }
    }
}

输出结果:

[root@master sbin]# hadoop fs -cat /hadoop_test/avro/result/part-r-00000
23/06/04 07:46:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
zs	FlowBean [phone=13877779999, addr=bj, name=zs, consum=16561]
[root@master sbin]# hadoop fs -cat /hadoop_test/avro/result/part-r-00001
23/06/04 07:46:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls	FlowBean [phone=13766668888, addr=sh, name=ls, consum=132117]
zs	FlowBean [phone=13877779999, addr=sh, name=zs, consum=23599]
[root@master sbin]# hadoop fs -cat /hadoop_test/avro/result/part-r-00002
23/06/04 07:46:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ww	FlowBean [phone=13544445555, addr=sz, name=ww, consum=21161]

总结:Mapper和reduce和之前一致,同样是name当作key,FLowBean当作value。

7、combiner——对wordCount进行combiner操作

目标:对wordCount 进行combiner操作
首先,combiner概念:个人理解就是map中的reduce操作。
思路:wordCount如果进行combiner,map中的keyout还是text,valueout还是1,combiner替代了reduce的位置,所以这个wordcount如果有combiner是无需reduce的,在combiner进行数量累加就可以了。
代码简单带过:

@Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] words = line.split(ConstantPool.SPACE_SIGNAL);
        for (String word :
                words) {
            context.write(new Text(word), new LongWritable(ConstantPool.DEFINE_LENGTH));
        }
    }

Combiner和之前的reduce一样,因为已经很熟悉了,也就不再进行注释了:

package hadoop_test.combiner_07;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountCombine extends Reducer<Text,LongWritable,Text,LongWritable>{

	@Override
	protected void reduce(Text key, Iterable<LongWritable> values,
			Context context) throws IOException, InterruptedException {
		long result=0;
		for(LongWritable value:values){
			result=result+value.get();
}
		context.write(key, new LongWritable(result));

	}
}

需要注意的是,Driver发生了变化,setReduceClass直接注释即可,但是同样的也是需要加入一个反射条件,setCombinerClass,但是不需要输出kv类型,接下来看代码:

package hadoop_test.combiner_07;

import hadoop_test.ConstantPool;
import hadoop_test.Utils_hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        job.setJarByClass(WordCountDriver.class);
        job.setMapperClass(MyWordCombinerMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //需要指定combiner的实现类,不用指定输出key和value类型
        job.setCombinerClass(MyWordCountCombiner.class);
        FileInputFormat.setInputPaths(job, new Path(ConstantPool.WORDCOUNT_READ_HDFS_PATH));

//        job.setReducerClass(wordReducer.class);
//        job.setOutputKeyClass(Text.class);
//        job.setOutputValueClass(LongWritable.class);
        if( Utils_hadoop.testExist(conf,ConstantPool.WORDCOUNT_WRITE_HDFS_PATH)){
            Utils_hadoop.rmDir(conf,ConstantPool.WORDCOUNT_WRITE_HDFS_PATH);}
        FileOutputFormat.setOutputPath(job, new Path(ConstantPool.WORDCOUNT_WRITE_HDFS_PATH));
        job.waitForCompletion(true);

    }

}

部分执行结果:

yourself"	1
yourself,	4
yourself,"	2
yourselves	1
yourselves,	1
yourselves.	2
yourselves."	1
youth	1
zigzagging	2
zombie,	2
zoo	4
zoo,"	1
zoo.	2
zoom	1
zoomed	1

根据时间即可查看是当前时间,即是被更新过的文件:

[root@master sbin]# hadoop fs -ls /hadoop_test/word_count/word_count_result/
23/06/04 09:07:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   3 liqinglin supergroup          0 2023-06-04 09:05 /hadoop_test/word_count/word_count_result/_SUCCESS
-rw-r--r--   3 liqinglin supergroup     117200 2023-06-04 09:05 /hadoop_test/word_count/word_count_result/part-r-00000

8、排序和全局排序

1⃣️排序

要求:基于电影评分,对电影进行排序

movie1 72
movie2 83
movie3 67
movie4 79
movie5 84
movie6 68
movie7 79
movie8 56
movie9 69 
movie10 57
movie11 68

思路:不需要reduce,直接map加到一个arraylist中,进行排序输出。

代码分析:MySortMapper.java

package hadoop_test.sort_test_08.sort;

import hadoop_test.ConstantPool;
import hadoop_test.sort_test_08.domain.Movie;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MySortMapper extends Mapper<LongWritable, Text,Movie, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Movie, NullWritable>.Context context) throws IOException, InterruptedException {

        String line = value.toString();
        String[] elements = line.split(ConstantPool.SPACE_SIGNAL);
        String movieName = elements[0];
        int hot = Integer.parseInt(elements[1]);
        //直接写到movie实体类中,重写了compareTo方法,根据hot降序
        Movie movie = new Movie();
        movie.setHot(hot);
        movie.setName(movieName);
        context.write(movie,NullWritable.get());
    }
}

总结:重点在于可以想到对compareTo方法进行重写,一开始还是想着放到数组中进行排序。

2⃣️全局排序

要求:利用3个reduce来处理,并且生成的三个结果文件,是整体有序的。生成的三个结果文件:不同位数一个文件。

源数据:
93 239 231
23 22 213
613 232 614
213 3939 232
4546 565 613
231 231
2339 231
1613 5656 657
61313 4324 213
613 2 232 32

目标数据:

2	1
22	1
23	1
32	1
93	1
213	3
231	4
232	4
239	1
322	1
1613	1
2339	1
3242	1
3613	1

思路:根据位数进行partition,相除10进行取整,得到的值进行分类,要求分为3类,因为考虑数据重复,map阶段类似于wordCount,valueout给1,在reduce进行累加就可以了。
代码分析:MyTotalSortMapper.java

package hadoop_test.sort_test_08.totalsort;

import hadoop_test.ConstantPool;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyTotalSortMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, IntWritable, IntWritable>.Context context) throws IOException, InterruptedException {
        String[] nums = value.toString().split(ConstantPool.SPACE_SIGNAL);
        for (String numString:
                nums) {
            int num = Integer.parseInt(numString);
            context.write(new IntWritable(num),new IntWritable(ConstantPool.DEFINE_INT_LENGTH));
        }
    }
}

MyTotalSortPartitioner.java

package hadoop_test.sort_test_08.totalsort;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner;

public class MyTotalSortPartitioner extends Partitioner<IntWritable,IntWritable> {
    @Override
    public int getPartition(IntWritable key, IntWritable value, int numPartitions) {
        int num = key.get();
        if (num/10<10){
            //则说明是两位数及以下
            return 0;
        }else if (num/10>=10 && num/10<100){
            //则证明是三位数
            return 1;
        }else {
            //四位数及以上
            return 2;
        }
    }
}

MyTotalSortReducer.java

package hadoop_test.sort_test_08.totalsort;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyTotalSortReducer extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {
    @Override
    protected void reduce(IntWritable key, Iterable<IntWritable> values, Reducer<IntWritable, IntWritable, IntWritable, IntWritable>.Context context) throws IOException, InterruptedException {
        int totalCounts = 0;
        for (IntWritable i :
                values) {
            int count = i.get();
            totalCounts += count;
        }
        context.write(key,new IntWritable(totalCounts));
    }
}

总结:核心逻辑就是类似word Count,只不过是数值类的wordCount,然后再进行分区,要求分为3类,但是分区这里参考答案用的是正则表达式,这里用的是除10,一样的道理,最后在reduce进行累加就可以了。

9、多文件合并

任务: 计算每个人三个月,每一课的总成绩
题目数据:chinese.txt

[root@master sbin]# hadoop fs -cat /hadoop_test/m_file_test/chinese.txt
23/06/05 10:44:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1 lisi 89
2 lisi 73
3 lisi 67
1 zhangyang 49
2 zhangyang 83
3 zhangyang 27
1 lixiao 77
2 lixiao 66
3 lixiao 89

English.txt

[root@master sbin]# hadoop fs -cat /hadoop_test/m_file_test/english.txt
23/06/05 10:45:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1 lisi 75
2 lisi 94
3 lisi 100
1 zhangyang 61
2 zhangyang 59
3 zhangyang 98
1 lixiao 25
2 lixiao 47
3 lixiao 48

Math.txt

[root@master sbin]# hadoop fs -cat /hadoop_test/m_file_test/math.txt
23/06/05 10:46:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1 lisi 75
2 lisi 94
3 lisi 100
1 zhangyang 61
2 zhangyang 59
3 zhangyang 98
1 lixiao 25
2 lixiao 47
3 lixiao 48

思路:通过切片获取到了分散在三个文件中的每个人的成绩,获取到后还是一行一行的读取内容,就类似于wordCount了,只需要将人作为key,将各科成绩作为value,在reduce对成绩进行累加就行。

代码分析:MyMutiFileMergeMapper.java

package hadoop_test.mutil_files_09.score;

import hadoop_test.ConstantPool;
import hadoop_test.mutil_files_09.domain.Score;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyMutiFileMergeMapper extends Mapper<LongWritable, Text,Text, Score> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Score>.Context context) throws IOException, InterruptedException {
        //因为是多文件,所以需要先通过流读取文件,拿到切片
        FileSplit split = (FileSplit) context.getInputSplit();

        //依旧是读取行
        String line = value.toString();
        String[] elements = line.split(ConstantPool.SPACE_SIGNAL);
        String studentName = elements[1];
        int score = Integer.parseInt(elements[2]);
        //因为目的是每个人每一科目成绩,
        //通过切片获取文件名称,也就是每科成绩
        String fileName = split.getPath().getName();
        //根据不同的分数进行不同的set
        Score studentScore = new Score();
        studentScore.setName(studentName);
        if (fileName.contains("english")){
            studentScore.setEnglish(score);
        }else if (fileName.contains("math")){
            studentScore.setMath(score);
        }else if (fileName.contains("chinese")){
            studentScore.setChinese(score);
        }
        context.write(new Text(studentName),studentScore);
    }
}

代码分析:MyMutiFileMergeReducer.java

package hadoop_test.mutil_files_09.score;

import hadoop_test.mutil_files_09.domain.Score;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyMutiFileMergeReducer extends Reducer<Text, Score,Text,Score> {

    @Override
    protected void reduce(Text key, Iterable<Score> values, Reducer<Text, Score, Text, Score>.Context context) throws IOException, InterruptedException {
        int totalChinese = 0;
        int totalMath = 0;
        int totalEnglish = 0;
        //读取进来是zs [1月份score,2月份score...]
        for (Score tmpScore :
             values) {
            totalChinese += tmpScore.getChinese();
            totalMath += tmpScore.getMath();
            totalEnglish += tmpScore.getEnglish();
        }
        //new 一个输出value,将三个总分数放进去
        Score outValue = new Score();
        outValue.setChinese(totalChinese);
        outValue.setMath(totalMath);
        outValue.setEnglish(totalEnglish);
        context.write(key,outValue);
    }
}

驱动:ScoreDriver.java

package hadoop_test.mutil_files_09.score;

import hadoop_test.ConstantPool;
import hadoop_test.Utils_hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import hadoop_test.mutil_files_09.domain.Score;
import org.apache.log4j.BasicConfigurator;


public class ScoreDriver {
	public static void main(String[] args) throws Exception {
		BasicConfigurator.configure();
		Configuration conf=new Configuration();
		Job job=Job.getInstance(conf);

		job.setJarByClass(ScoreDriver.class);

		job.setMapperClass(MyMutiFileMergeMapper.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Score.class);


		job.setReducerClass(MyMutiFileMergeReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Score.class);
		

		FileInputFormat.setInputPaths(job,new Path(ConstantPool.MUTIFILE_READ_HDFS_PATH));

		if( Utils_hadoop.testExist(conf,ConstantPool.MUTIFILE_WRITE_HDFS_PATH)){
			Utils_hadoop.rmDir(conf,ConstantPool.MUTIFILE_WRITE_HDFS_PATH);}

		FileOutputFormat.setOutputPath(job,new Path(ConstantPool.MUTIFILE_WRITE_HDFS_PATH));
		
		job.waitForCompletion(true);
	}
}

第一次执行结果报错:

Caused by: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to org.apache.hadoop.mapred.FileSplit;

发现类型转换异常,说明应该是导包错误,检查FileSplit的导包,更换导包为:

org.apache.hadoop.mapreduce.lib.input.FileSplit

执行成功,执行结果:

[root@master sbin]# hadoop fs -cat /hadoop_test/m_file_test/result/part-r-00000
23/06/05 11:47:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
lisi	Score [name=null, chinese=229, english=269, math=269]
lixiao	Score [name=null, chinese=232, english=120, math=120]
zhangyang	Score [name=null, chinese=159, english=218, math=218]

发现,name=null,了解到在reduce中忘了setName,添加以下代码:

outValue.setName(key.toString());

执行成功:

[root@master sbin]# hadoop fs -cat /hadoop_test/m_file_test/result/part-r-00000
23/06/05 11:50:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
lisi	Score [name=lisi, chinese=229, english=269, math=269]
lixiao	Score [name=lixiao, chinese=232, english=120, math=120]
zhangyang	Score [name=zhangyang, chinese=159, english=218, math=218]

总结:题目不难,但是依旧犯了很多错误,一开始也不容易想到利用对象进行添加out Value的思路,过程中依然出现了很多小错误需要进行纠正,多练习才能恰如其分的利用对象进行mapreduce的编写。

10、多级MR

要求:计算企业一个季度的利润并按照利润进行排序,总收入-总支出。第二列-第三列。(用一个company类来封装)
数据:

1 apple 1520 100
2 apple 3421 254
3 apple 4500 364
1 huawei 3700 254
2 huawei 2700 354
3 huawei 5700 554
1 xiaomi 3521 254
2 xiaomi 3123 354
3 xiaomi 3412 554

思路:要求是多级MR,那大概就是第一级mr实现收入-支出,第二级mr进行排序,所以第二级只有mapper,没有reduce。而第一级mr的keyout应该是companyName,valueout应该是elements[2]-elements[3](也就是Company);第二级mr的keyout还是companyName,valueout是Company。
代码分析:MyFirstMrMapper.java

package hadoop_test.mutil_mr_10.mr1;

import hadoop_test.ConstantPool;
import hadoop_test.mutil_mr_10.company.Company;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyFirstMrMapper extends Mapper<LongWritable, Text,Text, Company> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Company>.Context context) throws IOException, InterruptedException {
        //1 apple 1520 100
        String[] elements = value.toString().split(ConstantPool.SPACE_SIGNAL);
        //此处直接写为一行,Company实体类中直接新加两个构造方法即可
        context.write(new Text(elements[1]),new Company(elements[1],Integer.parseInt(elements[2])-Integer.parseInt(elements[3])));
    }
}

MyFirstMrReducer.java

package hadoop_test.mutil_mr_10.mr1;

import hadoop_test.mutil_mr_10.company.Company;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyFirstMrReducer extends Reducer<Text, Company,Text,Company> {
    @Override
    protected void reduce(Text key, Iterable<Company> values, Reducer<Text, Company, Text, Company>.Context context) throws IOException, InterruptedException {
        //拿到结果 apple [Company1,Company2,Company3....]
        Company company = new Company();
        for (Company co :
                values) {
            company.setProfit(company.getProfit()+co.getProfit());
        }
        company.setName(key.toString());
        context.write(key,company);
    }
}

第一次MR执行结果:

apple	apple 8723
huawei	huawei 10938
xiaomi	xiaomi 8894

小总结:发现多了个name,回想一下,Company实体类中本身自带了name,所以无需再给reduce中name,只需要给一个Company就可以了,修改Reduce逻辑。

package hadoop_test.mutil_mr_10.mr1;

import hadoop_test.mutil_mr_10.company.Company;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class MyFirstMrReducer extends Reducer<Text, Company,Company, NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<Company> values, Reducer<Text, Company, Company, NullWritable>.Context context) throws IOException, InterruptedException {
        //拿到结果 apple [Company1,Company2,Company3....]
        Company company = new Company();
        for (Company co :
                values) {
            company.setProfit(company.getProfit()+co.getProfit());
        }
        company.setName(key.toString());
        context.write(company,NullWritable.get());
    }
}

新第一次执行执行结果:

apple 8723
huawei 10938
xiaomi 8894

MySecondMrMapper.java

package hadoop_test.mutil_mr_10.mr2;

import hadoop_test.ConstantPool;
import hadoop_test.mutil_mr_10.company.Company;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MySecondMrMapper extends Mapper<LongWritable, Text,Company,NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Company, NullWritable>.Context context) throws IOException, InterruptedException {
        String[] elements = value.toString().split(ConstantPool.SPACE_SIGNAL);
        //因为实体类中重写了CompareTo方法,设置了升序排列
        Company company = new Company(elements[0],Integer.parseInt(elements[1]));
        context.write(company,NullWritable.get());
    }
}

第二个mr执行结果:

[root@master sbin]# hadoop fs -cat /hadoop_test/muti_mr/result1/part-r-00000
23/06/06 11:11:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
apple 8723
xiaomi 8894
huawei 10938

总结:相对来说是比较简单的两个mr执行过程,第二个mr直接读取进行放入即可完成排序,第一个mr也只是简单的累加计算。

作业2:如何利用mr做抽样,比例1%。 (自己的作业,写一遍)

思路:只有mapper,随机抽取百分之一,加if进行判断,1%的概率就可以。
代码分析:SamplingMapper.java

package hadoop_test.homework.samplingMR;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.Random;

public class SamplingMapper extends Mapper<LongWritable, Text,Text, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
        String line = value.toString();
        //做1%抽样,则在100中取一个即可
        Random random = new Random();
        int randomNum = random.nextInt(100);
        //split[0]为10000条数据,也就是在100以内都是,也就是100/10000,也就是1%
        if (randomNum == 55){
            context.write(new Text(line), NullWritable.get());
        }
    }
}

这里Mapper中的keyout使用实体类对象也可以,因为这里字符串较为简单,所以使用字符串输出即可,因为数据源就500条数据,所以抽样结果可能有几条浮动的差异。
SamplingDriver.java

package hadoop_test.homework.samplingMR;


import hadoop_test.ConstantPool;
import hadoop_test.Utils_hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class SamplingDriver {
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //程序入口,driver类
        job.setJarByClass(SamplingDriver.class);
        //设置mapper的类(蓝图,人类,鸟类),实例(李冰冰,鹦鹉2号),对象

        job.setMapperClass(SamplingMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

//        输入文件
        FileInputFormat.setInputPaths(job, new Path(ConstantPool.SAMPLING_READ_HDFS_PATH));

        if( Utils_hadoop.testExist(conf,ConstantPool.SAMPLING_WRITE_HDFS_PATH)){
            Utils_hadoop.rmDir(conf,ConstantPool.SAMPLING_WRITE_HDFS_PATH);}
        FileOutputFormat.setOutputPath(job, new Path(ConstantPool.SAMPLING_WRITE_HDFS_PATH));
        job.waitForCompletion(true);

    }
}

执行结果:

[root@master sbin]# hadoop fs -cat /hadoop_test/kmeans/sampling_result/part-r-00000
23/06/07 11:36:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
129,0.07810467465476313,0.035086550821436635,0.10014389508505363
165,0.5581375137631848,0.5611149534294716,0.5430448576160145
349,0.34434359364366407,0.29923101822132825,0.3542904620243115
383,0.3927649119423967,0.33808120477246356,0.30338018631826075
449,0.39016406254789027,0.3141077430960846,0.3687141672454513

[root@master sbin]# hadoop fs -cat /hadoop_test/kmeans/sampling_result/part-r-00000
23/06/07 11:37:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11,0.5483598064494786,0.5620634207706902,0.631409127357256
21,0.09710635332819276,0.12199420041262976,0.07594396396920762
381,0.11216857653153556,0.08255759703977189,0.09662167647547482
403,0.8208425124598153,0.7662913927455874,0.754585977824923
421,0.3455624309172101,0.3251024034449985,0.33090481763678936
433,0.6282458003031106,0.573718908890204,0.5997096630982724
65,0.8472924993766326,0.9142800468315485,0.7994335290433046
76,0.5997964273824741,0.5381577086398976,0.630918833446788

总结:这里我执行了两次,因为数据源数据量较小,所以可能会有结果上的差异,这就是类似于之后的聚类算法随机选取聚类中心了,为后面的选取聚类中心做铺垫而完成的案例。

作业3:MapReduce实现。抽k(4)组数据,每组数据可以若干数据,然后对每组数据求均值,作为初始化聚类中心。(进阶)

思路:和上一个作业类似,但是这个作业是抽4组,进行均值的计算。所以需要有2个mapreduce,第一个map随机取1/1000的值,第一个reduce进行分组,并在前边加上组号,第二个map进行取值,求平均。
经过后续思考,第一个mr可以通过combiner实现进行数据的分组,进行代码的优化。
Mapper代码分析:

package hadoop_test.homework.randomFourGroupCenter;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.Random;

/**
 * 作业3:MapReduce实现。抽k(4)组数据,每组数据可以若干数据,然后对每组数据求均值,作为初始化聚类中心。(进阶)
 */
public class RandomFourGroupCenterMapper extends Mapper<LongWritable, Text,IntWritable, Text> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, IntWritable, Text>.Context context) throws IOException, InterruptedException {

        String line = value.toString();
        Random random = new Random();
        int randomNum = random.nextInt(5);
        //这里如果id重复,则不再添加,其实可以更优化一些,取经纬度,再看是否重复,会更优化一些,因为即使id不一致,经纬度也可能一致
        if (randomNum == 1){
            context.write(new IntWritable(1),new Text(line));
        }
    }
}

Combiner代码分析:

package hadoop_test.homework.randomFourGroupCenter;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Random;

public class RandomFourGroupCenterCombiner extends Reducer<IntWritable, Text,IntWritable,Text> {
    @Override
    protected void reduce(IntWritable key, Iterable<Text> values, Reducer<IntWritable, Text, IntWritable, Text>.Context context) throws IOException, InterruptedException {
        //拿到的值 1[第一行,第二行....]
        Random random = new Random();
        //这里如果id重复,则不再添加,其实可以更优化一些,取经纬度,再看是否重复,会更优化一些,因为即使id不一致,经纬度也可能一致
        //分为4组
        for (Text text :
                values) {
            int randomNum = random.nextInt(4);
            String outValue = text.toString();
            if (randomNum == 1){

                outValue = "1,"+outValue;
            }else if (randomNum == 2){
                outValue = "2,"+outValue;
            }else if (randomNum == 3){
                outValue = "3,"+outValue;
            }else {
                outValue = "0,"+outValue;
            }
            context.write(new IntWritable(randomNum), new Text(outValue));
        }
    }
}

单独Combiner的执行结果(因为我取了100条数据,不再进行全部展示):

[root@master sbin]# hadoop fs -cat /hadoop_test/kmeans/group_result/part-r-00000
23/06/09 23:29:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2	2,500,0.628771643246236,0.608510619041765,0.5600686767849633
1	1,491,0.36805326068405414,0.2825814979914519,0.3319435714163698
0	0,486,0.4888675165647318,0.6361997201357025,0.589398070533413
2	2,471,0.7881827523793243,0.775235091269579,0.8400595559547913
2	2,459,0.36320529525841216,0.43707509026861396,0.28433249420421575
2	2,435,0.29446178836108194,0.2437057565841491,0.35190751209921683
0	0,433,0.6282458003031106,0.573718908890204,0.5997096630982724
3	3,426,0.36457281848428613,0.3309687014551464,0.33748480625841804
2	2,406,0.09617439342151708,0.08590318721344563,0.11924415611783382
3	3,401,0.6941980482073766,0.6179075325855481,0.6374617502330874
3	3,392,0.09045152526963907,0.08923917388938635,0.03857527690877679
3	3,389,0.6130474005203217,0.640628985315001,0.49245367699068365
...

作业4:hadoop案例分析ppt最后一页 ,然后基于聚类结果对数据进行区域热门top10推荐。(项目最难)有bug需要找到

先看下数据的结构如何构成:

id:代表数据集的第几条数据,从1到11376681。
target:代表该视频是否被用户点击了,1代表点击,0代表未点击。 
timestamp:代表改用户点击改视频的时间戳,如果未点击则为NULL。 
deviceid:用户的设备id。 
newsid:视频的id。 
guid:用户的注册id。 
pos:视频推荐位置。 
app_version:app版本。 
device_vendor:设备厂商。
netmodel:网络类型。 
osversion:操作系统版本。 
lng:经度。
lat:维度。 
device_version:设备版本。 
ts:视频暴光给用户的时间戳。

再看下怎么在文件中进行存储的:

[root@master ~]# cd /usr/local/hadoop_test/homework2
[root@master homework2]# cat sample_train.csv | head -10
23,0,,9a887c7be5401571603f912eb3ba172f,2.79908E+18,b04dfb77636b0e593f54b08d9eda0c5f,0,2.1.5,HUAWEI,o,9,5e-324,5e-324,FLA-AL20,1.5734E+12
51,0,,9a887c7be5401571603f912eb3ba172f,6.98551E+18,b04dfb77636b0e593f54b08d9eda0c5f,5,2.1.5,HUAWEI,o,9,5e-324,5e-324,FLA-AL20,1.57331E+12
1.统计每个用户点击物品的次数,以及每个用户的点击率(点击/曝光)

思路:首先确定用户点击物品的次数,只需要计算target=1的次数,点击率也就是target=1/(target=0+target=1)。确定map的outkey是userid,outvalue则是target即可;确定reducer的outkey应该是userid,outvalue应该是"点击物品次数_点击率"。
Mapper代码分析:

package hadoop_test.homework.hits;

import hadoop_test.ConstantPool;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author liqinglin
 */
public class UserHitsMapper extends Mapper<LongWritable, Text,Text,Text> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
        String[] elements = value.toString().split(ConstantPool.ENGLISH_COMMA_SIGNAL);
        //用户id
        String guid = elements[5];
        //是否点击      1:是  0:否
        String target = elements[1];
        context.write(new Text(guid),new Text(target));
    }
}

Reducer代码分析:

package hadoop_test.homework.hits;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author liqinglin
 */
public class UserHitsReducer extends Reducer<Text, Text, Text, Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException {
        //reduce取道的值应该为:用户1 [1,1,0,1,1]
        int total = 0;
        int click = 0;
        for (Text t :
                values) {
            String target = t.toString();
            total += 1;
            if ("1".equals(target)){
                //点击了视频
                click += 1;
            }
        }
        double hits = (double)click/total;
        String  outValue = String.format("%.2f",hits);
        context.write(key,new Text(total+"_"+outValue));
    }

    public static void main(String[] args) {
        int total =1;
        int click = 0;
        double hits = (double)total/click;
    }
}

采样部分执行结果:

fff67299c66b5fc54dc8a7981d0f4c63	68_0.00
fff75b7680f66f1641775313a09ebd57	9_0.11
fff7afc5305fa531aa81681602edfb48	5_0.00
fff866213ce7c58de08155f143dcdccd	5_0.00
fffb9ec12b793333bd7cfcce560c9b07	3_0.33
fffda083760eb42351312e86f8d64b9c	4_0.00
fffe0b7bb92faf8f56e7853abfd89016	1_0.00
....

总结:和之前的求平均值类似,换了个业务背景,比较简单,不做过多赘述。

2.找出用户最近观看的10个视频id。

思路:观看也就是点击,也就是最大的10个target=1的timestamp和newsid。首先在map阶段,outkey是guid,outvalue是timestamp_newsid;然后reduce阶段的outkey还是guid,在reduce中根据时间戳进行降序排序,取前10个的视频id,所以outvalue就是newsid就行。
Mapper代码分析:

package hadoop_test.homework.recentWatch;

import hadoop_test.ConstantPool;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

import static hadoop_test.ConstantPool.STRING_ONE;
import static hadoop_test.Data_utils.numberFormat;

/**
 * @author liqinglin
 */
public class RecentWatchMapper extends Mapper<LongWritable, Text,Text,Text> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
        String[] elements = value.toString().split(ConstantPool.ENGLISH_COMMA_SIGNAL);
        //用户id
        String guid = elements[5];
        //是否点击      1:是  0:否
        String target = elements[1];
        //视频id
        String newsId = elements[4];
        //视频点击时间戳
        String timeStamp = elements[2];

        //科学计数法转换为String类型
        timeStamp = numberFormat(timeStamp);
        newsId = numberFormat(newsId);

        if (STRING_ONE.equals(target)){
            context.write(new Text(guid),new Text(newsId+"_"+timeStamp));
        }
    }
}

Reduce代码分析:

package hadoop_test.homework.recentWatch;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

import static hadoop_test.Data_utils.numberFormat;

/**
 * @author liqinglin
 */
public class RecentWatchReducer extends Reducer<Text, Text, UserWatchNewsLog, NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, UserWatchNewsLog, NullWritable>.Context context) throws IOException, InterruptedException {
        //reduce取到的值应该为:用户1 [112332434_newsId1,123445454_newsId2,1123454565974_newsId3.....]
        int count = 0;
        for (Text t :
                values) {
            count += 1;
            String[] newsIdAndTimeStamps = t.toString().split("_");
            String timeStampString = newsIdAndTimeStamps[1];
            String newsId = newsIdAndTimeStamps[0];

//            String timeStampStringng = String.valueOf(u.getTimeStamp());
//            String newsId = u.getNewsId();

            //科学计数法转换为String类型
            timeStampString = numberFormat(timeStampString);
            newsId = numberFormat(newsId);

            long timeStamp = Long.parseLong(timeStampString);
            String guid = key.toString();
            UserWatchNewsLog user = new UserWatchNewsLog(guid,newsId,timeStamp);
            context.write(user, NullWritable.get());
        }
    }
}

执行结果:

UserWatchNewsLog{guid='ffe95562f720835e85083e4c67975f32', timeStamp=1573300000000, newsId='4464960000000000000'}
UserWatchNewsLog{guid='fff3741aca586fa069c771947f1dffed', timeStamp=1573180000000, newsId='1195930000000000000'}
UserWatchNewsLog{guid='fff58a9eedd5b038cabe25fdb3d0ce9b', timeStamp=1573380000000, newsId='506065000000000000'}
UserWatchNewsLog{guid='fff75b7680f66f1641775313a09ebd57', timeStamp=1573390000000, newsId='3188910000000000000'}
UserWatchNewsLog{guid='fffb9ec12b793333bd7cfcce560c9b07', timeStamp=1573340000000, newsId='24548000000000000'}
....

总结:一开始想要使用实体类进行实现排序,发现CompareTo只能识别int,而时间戳又是long类型。而科学记数法转换又导致精度失真,所以得到的结果还是有一定缺陷。

3.请统计每个用户曝光的视频数量

思路:曝光的视频数量,而非曝光的次数,所以需要对视频进行去重处理。在mapper中的keyout是guid,valueout是newsId。

Mapper分析:

package hadoop_test.homework.expNewsNum;

import hadoop_test.ConstantPool;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author liqinglin
 */
public class ExpNewsNumMapper extends Mapper<LongWritable, Text,Text,Text> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
        String[] elements = value.toString().split(ConstantPool.ENGLISH_COMMA_SIGNAL);
        //用户id
        String guid = elements[5];
        //视频id
        String newsId = elements[4];
        context.write(new Text(guid),new Text(newsId));
    }
}

Reducer分析:

package hadoop_test.homework.expNewsNum;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.HashSet;

/**
 * @author liqinglin
 */
public class ExpNewsNumReducer extends Reducer<Text, Text, Text, Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException {
        HashSet<String> set = new HashSet<>();
        //reduce取道的值应该为:用户1 [newsID1,newsID2,newsID3....]
        for (Text t :
                values) {
            String newsId = t.toString();
            set.add(newsId);
        }
        //从HashSet中取出来,也就是去重完毕的newsId
        for (String newsId :
                set) {
            context.write(key,new Text(newsId));
        }
    }
}

抽取部分执行结果:

fff866213ce7c58de08155f143dcdccd	3.78391E+18
fff866213ce7c58de08155f143dcdccd	8.33254E+18
fff866213ce7c58de08155f143dcdccd	3.74766E+18
fff866213ce7c58de08155f143dcdccd	8.58659E+18
fff866213ce7c58de08155f143dcdccd	4.35661E+18
fffb9ec12b793333bd7cfcce560c9b07	2.4548E+16
fffb9ec12b793333bd7cfcce560c9b07	3.63266E+18
fffb9ec12b793333bd7cfcce560c9b07	2.07164E+18
fffda083760eb42351312e86f8d64b9c	5.72411E+18
fffda083760eb42351312e86f8d64b9c	5.73286E+18
fffda083760eb42351312e86f8d64b9c	4.96379E+18
fffda083760eb42351312e86f8d64b9c	5.34932E+18
fffe0b7bb92faf8f56e7853abfd89016	6.8087E+18
...
4.找出每个用户最常用(观看视频最多)的设备。

思路:求每个用户设备看视频设备的最大值,所以第一步map应该是outkey为guid,而outvalue则是deviceid,再在Reduce中进行deviceid的去重,选出最多的device。

Mapper代码分析:

package hadoop_test.homework.mostDevice;

import hadoop_test.ConstantPool;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author liqinglin
 */
public class MostDeviceMapper extends Mapper<LongWritable, Text,Text,Text> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
        String[] elements = value.toString().split(ConstantPool.ENGLISH_COMMA_SIGNAL);
        //用户id
        String guid = elements[5];
        //设备id
        String deviceId = elements[3];
        //视频id
        String newsId = elements[4];
        context.write(new Text(guid),new Text( deviceId+"_"+newsId));
    }
}

Combiner代码分析:

package hadoop_test.homework.mostDevice;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.HashSet;

/**
 * @author liqinglin
 */
public class MostDeviceCombiner extends Reducer<Text, Text, Text, Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException {
        //combiner取道的值应该为:用户1 [newsId1_deviceId1,newsId2_deviceId2,newsId3_deviceId3....]
        HashSet<String> set = new HashSet<>();
        int total = 0;
        for (Text t :
                values) {
            total += 1;
            String[] newsIdAndDeviceId = t.toString().split("_");
            String deviceId = newsIdAndDeviceId[0];
            set.add(deviceId);
        }
        for (String deviceId :
                set) {
            context.write(new Text(key),new Text(deviceId+"_"+total));
        }
    }
}

Combiner部分执行结果:

fff3741aca586fa069c771947f1dffed	0156d612f6c44a5225d86b3d93573ad6_9
fff3a4cca566dc9c840672af66c1beb2	dea9ce182c6fdc1fadddd1cc045702cd_13
fff58a9eedd5b038cabe25fdb3d0ce9b	2a9a11da191bd95c17d373905d495a93_2
fff5b5ebb644ad73ecf931053a5b31ee	884cc8e526cb7743b937a3ce9045b1ec_1
fff67299c66b5fc54dc8a7981d0f4c63	afbd3c40d2d397b5327428f63e48dd6e_68
fff75b7680f66f1641775313a09ebd57	69ae0d83e5f59728bb0c20eb08ade50a_9
fff7afc5305fa531aa81681602edfb48	feacf51fa54a61327506c85fbe9c1fcc_5
fff866213ce7c58de08155f143dcdccd	5fda8c5ae615c844a2673afb3098cebc_5
fffb9ec12b793333bd7cfcce560c9b07	e49364babe10eb5e503ff547a7d71cb5_3
fffda083760eb42351312e86f8d64b9c	1f257188f4e159744bd5651028fb8dee_4
fffe0b7bb92faf8f56e7853abfd89016	43f6b791192ec7c231657e0da0ae4565_1
...

Reudcer代码分析:

package hadoop_test.homework.mostDevice;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 - @author liqinglin
 */
public class MostDeviceReducer extends Reducer<Text, Text, Text, Device> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Device>.Context context) throws IOException, InterruptedException {
        //reducer取道的值应该为:用户1  [设备1_12,设备2_22,设备3_14....]
        Device newDevice = new Device();
        for (Text t : values) {
            String[] deviceIdAndCount = t.toString().split("_");
            String deviceId = deviceIdAndCount[0];
            int count = Integer.parseInt(deviceIdAndCount[1]);;
            context.write(key,new Device(deviceId,count));
            if (newDevice.getCount() < count){
                newDevice.setDeviceId(deviceId);
                newDevice.setCount(count);
            }
        }
        context.write(key,newDevice);
    }
}

执行结果:

fffb9ec12b793333bd7cfcce560c9b07	Device{deviceId='e49364babe10eb5e503ff547a7d71cb5', count=3}
fffb9ec12b793333bd7cfcce560c9b07	Device{deviceId='e49364babe10eb5e503ff547a7d71cb5', count=3}
fffda083760eb42351312e86f8d64b9c	Device{deviceId='1f257188f4e159744bd5651028fb8dee', count=2}
fffda083760eb42351312e86f8d64b9c	Device{deviceId='1f257188f4e159744bd5651028fb8dee', count=2}
fffda083760eb42351312e86f8d64b9c	Device{deviceId='1f257188f4e159744bd5651028fb8dee', count=2}
fffe0b7bb92faf8f56e7853abfd89016	Device{deviceId='43f6b791192ec7c231657e0da0ae4565', count=1}
fffe0b7bb92faf8f56e7853abfd89016	Device{deviceId='43f6b791192ec7c231657e0da0ae4565', count=1}
...

一直有重复的数据,因为在combiner阶段执行的结果是没错的,可是在reduce怎么可能重复呢,百思不得其解,仔细检查代码,发现for循环里多写了一个

context.write(key,new Device(deviceId,count));

顿时欲哭无泪,找了一个多小时居然是这个原因,只能不得不感叹,还是要小心写代码。
去掉这一行执行结果:

fff67299c66b5fc54dc8a7981d0f4c63	Device{deviceId='afbd3c40d2d397b5327428f63e48dd6e', count=58}
fff75b7680f66f1641775313a09ebd57	Device{deviceId='69ae0d83e5f59728bb0c20eb08ade50a', count=7}
fff7afc5305fa531aa81681602edfb48	Device{deviceId='feacf51fa54a61327506c85fbe9c1fcc', count=4}
fff866213ce7c58de08155f143dcdccd	Device{deviceId='5fda8c5ae615c844a2673afb3098cebc', count=5}
fffb9ec12b793333bd7cfcce560c9b07	Device{deviceId='e49364babe10eb5e503ff547a7d71cb5', count=3}
fffda083760eb42351312e86f8d64b9c	Device{deviceId='1f257188f4e159744bd5651028fb8dee', count=2}
fffe0b7bb92faf8f56e7853abfd89016	Device{deviceId='43f6b791192ec7c231657e0da0ae4565', count=1}
...

总结:这里的Device实体类用不用都可以,直接写deviceId也可以,因为题目只要设备,就可以,我这里计算了个数量。其实最重要一点还是最大值的精妙计算的地方。

5.基于用户的经度、纬度,对用户进行区域聚类。

思路:根据经纬度进行聚类算法,简单来讲和上课讲的聚类是一致的,大致思路如下:

  • 1⃣️首先是进行数据归一化或标准化,由于经纬度的维度是一样的,所以无需进行标准化或归一化。
  • 2⃣️然后进行聚类中心的选择,聚类中心最好不要进行随机选择,这样会导致后续更新聚类中心次数过多,最好可以进行n组聚类中心然后取平均值当作旧的聚类中心。
  • 3⃣️再是选择距离,距离分为欧式距离、cos距离、jeccard距离、曼哈顿距离。
  • 4⃣️然后是迭代更新聚类中心
  • 5⃣️然后最后选择停止条件,一般是有两种终止条件,一个是大于多少次迭代,另一个是距离小于0.01(0.1)这种,小于多少根据实际情况而定。
  • 6⃣️此时最终的聚类中心已经算出来了,最后再对原始用户数据进行区域聚类。
  • 7⃣️总结:对原始用户数据进行区域聚类的目的,还是为了进行基于区域热门的推荐算法,回归到业务。

初步思路分析:初始化聚类中心的Mapper应该进行粗粒度的数据提取,这里选取随机数进行1/1000的数据提取,所以map阶段的outkey应该是1就可以,outvalue应该是1/1000的所有数据的经纬度;而reduce阶段应该是做细粒度的数据提取,只需要再定一个取数据的数量,进行随机提取即可。
InitRandomCenter代码分析:

package hadoop_test.homework_liqinglin.homework04.homework04_05_clusterCenter;

import hadoop_test.ConstantPool;
import hadoop_test.Utils_hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.*;

/**
 * @author liqinglin
 */
public class InitRandomCenter {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs={"/hadoop_test/homework2/sample_train.csv","/hadoop_test/homework2/old_center","10"};

        if( Utils_hadoop.testExist(conf,otherArgs[1])){
            Utils_hadoop.rmDir(conf,otherArgs[1]);}

        conf.set("ClusterNum", otherArgs[2]);
        FileSystem fs = FileSystem.get(conf);
        Path centerPath = new Path(otherArgs[1]);
        fs.deleteOnExit(centerPath);

        Job job = new Job(conf, "InitRandomCenter");
        job.setJarByClass(InitRandomCenter.class);
        job.setMapperClass(InitMap.class);
        job.setReducerClass(InitReduce.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        //设置输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(1);

        //设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        if( Utils_hadoop.testExist(conf, otherArgs[1])){
            Utils_hadoop.rmDir(conf,otherArgs[1]);}
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        job.waitForCompletion(true);
    }

    static class InitMap extends Mapper<Object, Text, Text, Text> {
        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            String[] lines = value.toString().split(",");
            //取出经纬度
            String values=lines[11]+","+lines[12];
            int rand=new Random().nextInt(1000);
            //目的是抽取千分之一的数据,因为rand=2或者任意(0,1000)的数字的概率为1/1000
            if (rand== ConstantPool.DEFINE_INT_LENGTH) {
//            对于初始聚类中心文件输出 kmeans,[1,3,2,4]
                context.write(new Text("1"), new Text(values));
            }
        }
    }

    static class InitReduce extends Reducer<Text, Text, Text, NullWritable> {
        int clusterNum;
        HashSet<Integer> set = new HashSet<>();
        //避免重复index进入初始聚类中心
        private static Set<Integer> indexSet = new HashSet<Integer>();

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            Configuration conf = context.getConfiguration();
            clusterNum = Integer.parseInt(conf.get("ClusterNum"));
        }

        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //取到的值为 1 [0.12322,0.1234,1.3456,1.3455,....]
            //因为取了1/1000,这里还剩1000个经纬度
            List<String> arr = new ArrayList<>();
            for (Text text:
                 values) {
                arr.add(text.toString());
            }
            while (set.size() < clusterNum){
                Random random = new Random();
                int randomNum = random.nextInt(arr.size());
                set.add(randomNum);
            }
            for (int index :
                    set) {
                String lngAndLat = arr.get(index);
                context.write(new Text(index+"\t"+lngAndLat),NullWritable.get());
            }
        }
    }
}

Init代码执行结果:

[root@master sbin]# hadoop fs -cat /hadoop_test/homework2/old_center/part-r-00000
23/06/12 02:33:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
240		112.907562,28.685405
711		111.129897,23.368204
200		5e-324,5e-324
714		109.584484,25.900079
459		122.968479,41.106497
909		5e-324,5e-324
575		106.500008,29.468042
239		108.716038,19.041799
175		114.074874,22.902675
543		107.051585,33.079186

初步总结:看过元数据,由于数据源有很多的科学记数法,这里不再进行处理,其实可以优化的地方还有很多,一是可以进行经纬度的去重处理,二是可以随机n组聚类中心,进行取均值,这样会更准确。

第二步思路:取到老的聚类中心的数据,然后再和源数据中的经纬度进行距离的计算并比较,确定所有数据属于哪一个老聚类中心。所以Mapper中的setUp应该是取老的聚类中心的数据放在数组中进行存储,map中的outkey还是老的聚类中心的key,outvalue则是属于这一个key聚类中心的所有数据的组合。
Mapper代码分析:

package hadoop_test.homework_liqinglin.homework04.homework04_05_clusterCenter;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

/**
 * @author liqinglin
 */
public class KmeansMapper extends Mapper<LongWritable, Text, Text, Text> {
    List<ArrayList<String>> centers;
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        //从oldCenter中读取老聚类中心的所有数据,并加载到centers这个list中
        centers= Util.getCenterFile(DataSource.old_center+"/part-r-00000");

    }
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //读取数据源中的所有的数据
        String[] data = value.toString().split(",");

        //将数据源中的信息进行暂时的存储
        List<String> tmp = new ArrayList<>();
        //index
        tmp.add(data[0]);
        //lng
        tmp.add(data[11]);
        //lat
        tmp.add(data[12]);

        // 默认聚类中心为0号中心点
        String outKey="" ;
        //用于计算距离的最小值
        double minDist = Double.MAX_VALUE;
        //外层循环主要就是 去遍历所有聚类中心
        for (int i = 0; i < centers.size(); i++) {
            double dist = 0;
            //第一个位置是cluster_id
            //你内存循环主要就是计算一个样本距离一个聚类中心的距离。
            for(int j=1;j<centers.get(i).size();j++){
                //j从1开始算,因为center.get(i).get(0)和tmp.get(0)都是簇id,不需要进行距离计算
                double sourceDist=Double.parseDouble(tmp.get(j));
                double centerDist=Double.parseDouble(centers.get(i).get(j));
                //计算距离差的平方
                dist+=Math.pow(sourceDist-centerDist,2);
            }
            if (dist < minDist) {
                //如果计算出的距离小于最小距离,则将和此聚类中心的距离赋给最小距离,且将这个key作为outKey进行输出,因为这条数据是属于这个outKey的
                outKey = centers.get(i).get(0);
                minDist = dist;
            }
        }
        String valueOut= data[0]+","+data[11]+","+data[12];
        context.write(new Text(outKey), new Text(valueOut));
    }
}

在Combiner阶段进行了初步的lat和lng计算求和,与属于这个聚类中心的原数据共有多少条数据,便于后续reduce进行除法计算新的聚类中心。
Combiner代码分析:

package hadoop_test.homework_liqinglin.homework04.homework04_05_clusterCenter;

import hadoop_test.ConstantPool;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author liqinglin
 */
public class KmeansCombiner extends Reducer<Text, Text, Text,Text> {
    @Override
    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,InterruptedException{
        //key为类编号,value为类内所有样本,Iterable<Text> values,海量数据,类少的情况下 value是海量的
        double[] latAndLngSum = new double[DataSource.feat_num];
        //记录一个mapper的key总共有多少条数据
        long count =0;
        for(Text text:values){
            //取到所有源数据,海量
            String[] elements=text.toString().split(ConstantPool.ENGLISH_COMMA_SIGNAL);
            //进行累加,此处仅有两个元素,如果多个元素该如何
            for (int i = 1; i <elements.length ; i++) {
                //此处从1开始计算,则只有经度和纬度,做经度和维度的累加计算
                latAndLngSum[i-1]+=Double.parseDouble(elements[i]);
            }
            //统计属于每个聚类中心的共有多少条数据
            count+=1;
        }

        StringBuilder result= new StringBuilder(key.toString() + "::");
        for (int i = 0; i <latAndLngSum.length ; i++) {
            result.append(latAndLngSum[i]).append(",");
        }

        //valueout  1::[123.32,343.42]::129  [行号,各个字段累加,总共的行数]
        context.write(key,new Text(result.substring(0,result.length() -1)+"::"+count));
    }
}

在reduce阶段进行除法计算新的聚类中心。
Reduce代码分析:

package hadoop_test.homework_liqinglin.homework04.homework04_05_clusterCenter;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author liqinglin
 */
public class KmeansReducer extends Reducer<Text, Text, Text,Text> {
    @Override
    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,InterruptedException{
        long num=0;
        double[] re=new double[DataSource.feat_num];
        //value  [1::123.9,32.4::35,1::123.9,32.4::35]
        for(Text text:values){
            //text  1::123.9,32.4::35 -----> [1,"123.9,32.4",35]
            //也就是取到的value值为  key::总lat,总lng::总数
            String[] tmp=text.toString().split("::");
            num+=Integer.parseInt(tmp[2]);
            String[] features = tmp[1].split(",");
            //进行累加,此处仅有两个元素,如果多个元素该如何
            for (int i = 0; i <features.length ; i++) {
                re[i]+=Double.parseDouble(features[i]);
            }
        }
        StringBuilder result= new StringBuilder();
        for (int i = 0; i <re.length ; i++) {
            //进行除法计算,取其为新的聚类中心
            result.append(re[i] / num).append(",");
        }

        //去除逗号
        String outValue = result.substring(0,result.length() -1);
        context.write(key,new Text(outValue));
    }
}

这里还需要一个新的mapper,进行源数据的重组,在最后拼接上属于哪个聚类中心,这里的map和初步map整体是一致的,只有输出不同。
KmeansMapperRedict代码分析:

package hadoop_test.homework_liqinglin.homework04.homework04_05_clusterCenter;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

/**
 * @author liqinglin
 */
public class KmeansMapperPredict extends Mapper<LongWritable, Text, Text, NullWritable> {
    List<ArrayList<String>> centers;
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        //从oldCenter中读取老聚类中心的所有数据,并加载到centers这个list中
        centers= Util.getCenterFile(DataSource.old_center+"/part-r-00000");

    }
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //读取数据源中的所有的数据
        String[] data = value.toString().split(",");

        //将数据源中的信息进行暂时的存储
        List<String> tmp = new ArrayList<>();
        //index
        tmp.add(data[0]);
        //lng
        tmp.add(data[11]);
        //lat
        tmp.add(data[12]);


        // 默认聚类中心为0号中心点
        String outKey="" ;
        //用于计算距离的最小值
        double minDist = Double.MAX_VALUE;
        //外层循环主要就是 去遍历所有聚类中心
        for (int i = 0; i < centers.size(); i++) {
            double dist = 0;
            //第一个位置是cluster_id
            //你内存循环主要就是计算一个样本距离一个聚类中心的距离。
            for(int j=1;j<centers.get(i).size();j++){
                //j从1开始算,因为center.get(i).get(0)和tmp.get(0)都是簇id,不需要进行距离计算
                double sourceDist=Double.parseDouble(tmp.get(j));
                double centerDist=Double.parseDouble(centers.get(i).get(j));
                //计算距离差的平方
                dist+=Math.pow(sourceDist-centerDist,2);
            }
            if (dist < minDist) {
                //如果计算出的距离小于最小距离,则将和此聚类中心的距离赋给最小距离,且将这个key作为outKey进行输出,因为这条数据是属于这个outKey的
                outKey = centers.get(i).get(0);
                minDist = dist;
            }
        }
        String keyOut = value+","+outKey;
        context.write(new Text(keyOut), NullWritable.get());
    }
}

到这里了应该记录一下还有两个通用类,一个是存储地址那些常量用的:

package hadoop_test.homework_liqinglin.homework04.homework04_05_clusterCenter;

import org.apache.hadoop.fs.Path;

public class DataSource {
    public static final int K=10;//聚类的类别数为5
//存我们的训练数据hdfs路径
    public static final String inputlocation="/hadoop_test/homework2/sample_train.csv";
//    这个存我们老的聚类中心
    public static final String old_center="/hadoop_test/homework2/old_center";
    public static final String new_center="/hadoop_test/homework2/new_center";
    public static final String result_data="/hadoop_test/homework2/result_data";
//    迭代次数
    public static final int REPEAT=2;
//    阈值
    public static final float threshold=(float)0.01;

    public static Path inputpath=new Path(inputlocation);
    public static Path oldCenter=new Path(old_center);
    public static Path newCenter=new Path(new_center);
//字段个数,特征个数
    public static int feat_num=2;
}

另一个则是存储通用方法的类,例如停止条件,以及删除上一次的reduce结果,以及获取聚类中心。

package hadoop_test.homework_liqinglin.homework04.homework04_05_clusterCenter;

import hadoop_test.ConstantPool;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.util.LineReader;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Util {
    //中心点的个数
    public static final int K=10;
    //读出中心点,注意必须包括键值
    public static List<ArrayList<String>> getCenterFile(String inputPath){
        //return  [[1.1,1.2,1.3],[2.1,2.1,2.3],[3.1,1.4,1.2]]
        List<ArrayList<String>> centers=new ArrayList<ArrayList<String>>();
        Configuration conf=new Configuration();
        try{
            FileSystem fs=DataSource.oldCenter.getFileSystem(conf);
            Path path=new Path(inputPath);
            FSDataInputStream fsIn=fs.open(path);
            //一行一行读取参数,存在Text中,再转化为String类型
            Text lineText=new Text();
            String tmpStr=null;
            LineReader linereader=new LineReader(fsIn,conf);
            while(linereader.readLine(lineText)>0){
                ArrayList<String> oneCenter=new ArrayList<>();
                tmpStr=lineText.toString();
                //分裂String,存于容器中
                String[] tmp=tmpStr.replace("\t"," ").trim().replace(" ",",").split(",");

                for(int i=0;i<tmp.length;i++){
                    oneCenter.add(tmp[i]);
                }
                //将此点加入集合
                centers.add(oneCenter);
            }
            fsIn.close();
        }catch(IOException e){
            e.printStackTrace();
        }
        //返回容器

        return centers;
    }

    //判断是否满足停止条件
    public static boolean isStop(String oldpath,String newpath,int repeats,float threshold)
            throws IOException{
        //threshold 0.1
        //获得输入输出文件 center_inputlocation
//        Old
        //[[1,123.5483598064494786,34.5620634207706902],

//        new
//        //[[1,130.5r83598064494786,29.5620634207706902],

//        new
//        //[[1,123.5583598064494786,34.5720634207706902],
        List<ArrayList<String>> oldcenters= Util.getCenterFile(oldpath);
        List<ArrayList<String>> newcenters= Util.getCenterFile(newpath);

        //累加中心点间的距离
        float distance=0;
        for(int i=0;i<K-1;i++){
            //计算新旧聚类中心的距离
            for(int j=1;j<oldcenters.get(i).size();j++){
                float tmp=Math.abs(Float.parseFloat(oldcenters.get(i).get(j))
                        -Float.parseFloat(newcenters.get(i).get(j)));
                distance+=Math.pow(tmp,2);
            }
        }
        /*如果超出阈值,则返回false
         * 否则更新中心点文件
         */
        System.out.println(distance);
        System.out.println(repeats);
//        没有停止迭代
        if(distance<threshold || DataSource.REPEAT<repeats) {
            return false;
        }
        //核心,是将第一轮旧的聚类中心删除,第一轮新的聚类中心替换第一轮旧聚类中心,
        // 作为第二轮的初始聚类中心

        //1、删除旧的聚类中心文件
        Util.deleteLastResult(oldpath);
        //2. 将新的聚类中心文件移到旧的聚类中心文件中去
        Configuration conf=new Configuration();

//      老的聚类中心
        Path npath=new Path(DataSource.old_center);

        FileSystem fs=npath.getFileSystem(conf);
        //通过local作为中介,从hdfs上拉数据到本地。多余



        fs.moveToLocalFile(new Path(newpath), new Path(
                ConstantPool.TMP_SAVE_LOCAL_PATH));
        fs.delete(new Path(oldpath), true);//在写入inputpath之前再次确保此文件不存在
        fs.moveFromLocalFile(new Path(ConstantPool.TMP_SAVE_LOCAL_PATH)
                ,new Path(oldpath));
        return true;



    }

    //删除上一次mapreduce的结果
    public static void deleteLastResult(String inputpath){
        Configuration conf=new Configuration();
        try{
            Path path=new Path(inputpath);
            FileSystem fs2= path.getFileSystem(conf);
            fs2.delete(new Path(inputpath),true);
        }catch(IOException e){
            e.printStackTrace();
        }
    }
}

最后就是启动类了,这里由于需要不停的对数据进行训练,所以会有一个do…while循环,确定一个停止条件:大于指定重复次数以及聚类中心与属于聚类中心的原数据距离小于指定距离。
Driver代码分析:

package hadoop_test.homework_liqinglin.homework04.homework04_05_clusterCenter;

import hadoop_test.Utils_hadoop;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class KmeansDriver {
    // 主函数
    public static void main(String[] args) throws Exception {
        int repeats = 0;
        do {

            //每一轮都需要构造conf
            Configuration conf = new Configuration();

            // 新建MapReduce作业并指定作业启动类
            Job job = new Job(conf);

            // 设置输入输出路径(输出路径需要额外加判断)
            job.setJarByClass(KmeansDriver.class);
            //1. 设置输入路径(指的是文件的输入路径)
            FileInputFormat.addInputPath(job, DataSource.inputpath);
            FileSystem fs = DataSource.newCenter.getFileSystem(conf);

            // 设置输出路径(指的是中心点的输出路径)
            if (fs.exists(DataSource.newCenter)) {
                fs.delete(DataSource.newCenter, true);
            }

            FileOutputFormat.setOutputPath(job, DataSource.newCenter);
            // 为作业设置map和reduce所在类xl
            job.setMapperClass(KmeansMapper.class);
            job.setCombinerClass(KmeansCombiner.class);

            job.setReducerClass(KmeansReducer.class);
            // 设置输出键和值的类
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
            // 启动作业
            job.waitForCompletion(true);
            repeats++;
            //设置停止条件,当reeats>10次,且距离小于0.01的时候,则不再执行
            } while (repeats < DataSource.REPEAT && (Util.isStop(DataSource.old_center + "/part-r-00000",
                    DataSource.new_center + "/part-r-00000", repeats, DataSource.threshold)));

        //进行最后的聚类工作(由map来完成)
        Configuration c_conf = new Configuration();
        // 新建MapReduce作业并指定作业启动类
        Job c_job = new Job(c_conf);
        // 设置输入输出路径(输出路径需要额外加判断)
        // 为作业设置map(没有reducer,则看到的输出结果为mapper的输出)
        c_job.setMapperClass(KmeansMapperPredict.class);

        // 设置输出键和值的类
        c_job.setOutputKeyClass(Text.class);
        c_job.setOutputValueClass(NullWritable.class);
        FileInputFormat.setInputPaths(c_job, new Path(DataSource.inputlocation));

        if( Utils_hadoop.testExist(c_conf,DataSource.result_data)){
            Utils_hadoop.rmDir(c_conf,DataSource.result_data);}
        FileOutputFormat.setOutputPath(c_job, new Path(DataSource.result_data));
        c_job.waitForCompletion(true);
    }
}

取样部分最终执行结果:

9999944,0,,9b7f14319274911456c34fcfc4d65e02,5.31237E+18,34d5c803111b370049278fc451b7279d,1,2.1.5,Hisense,w,7.1.2,99.354058,25.219612,HLTEM800,1.57329E+12,162
9999947,0,,9b7f14319274911456c34fcfc4d65e02,8.45111E+18,34d5c803111b370049278fc451b7279d,2,2.1.5,Hisense,o,7.1.2,99.634656,24.856997,HLTEM800,1.57318E+12,162
999995,0,,ca738b1dc69585e84a61cb3038ac6edf,1.02651E+18,8176384ec7c48958461fe265ebc0fa1b,3,2.1.5,HUAWEI,o,8.1.0,109.038469,34.365967,JKM-AL00,1.5732E+12,564
9999960,1,1.57335E+12,fc1dd2b66596e167d30ff2d2ca9ac419,2.08931E+18,1e64681b6ec2618c4b16210e2df679bc,4,2.1.3,vivo,o,7.1.2,105.576038,30.549127,vivo Y66i,1.57335E+12,162
9999973,1,1.57335E+12,fc1dd2b66596e167d30ff2d2ca9ac419,6.74084E+18,1e64681b6ec2618c4b16210e2df679bc,4,2.1.3,vivo,o,7.1.2,105.576038,30.549127,vivo Y66i,1.57335E+12,162
9999986,0,,829eeca0b5ffd3148dca9c798bf100d0,5.4496E+18,e1a1650a0fb58e69ac6d8c7ed9d0cabe,0,2.1.5,HONOR,w,9,95.775268,40.572859,ARE-AL10,1.57326E+12,806
9999989,0,,d5e65f3911bca96c108ffcb222f81930,1.7953E+18,eae5fcf571e973d4703c99b8fe8078e3,2,2.1.5,Honor,w,4.4.2,120.304667,31.630557,CHM-TL00,1.57327E+12,93

总结

到这里,Hadoop之旅就彻底结束了,历时2周才把这么多题理解并写完。总的来说,Hadoop需要知道的就是给定一个需求,可以快速写出来mr,确定出来outkey和outvalue,理解到位分布式思想。然后就是聚类算法,深入理解以下这个图,还是很重要的,知道怎么选取聚类中心,怎么进行优化,知道归一化标准化怎么进行,知道距离是什么,怎么算距离,什么样的业务需要选取什么样的距离。
聚类算法核心迭代图

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值