hadoop学习过程中的知识记录（二）

最新推荐文章于 2022-09-15 16:29:17 发布

青鸟飞雪

最新推荐文章于 2022-09-15 16:29:17 发布

阅读量378

点赞数 1

分类专栏：大数据学习笔记文章标签： hadoop 大数据

本文链接：https://blog.csdn.net/qq_40774413/article/details/109786742

版权

本文详细记录了Hadoop MapReduce的学习过程，包括NLineInputFormat案例、自定义InputFormat、分区策略、WritableComparable排序、Combiner合并以及数据输出的自定义方式等关键知识点，旨在帮助读者深入理解MapReduce的工作原理和实践技巧。

摘要由CSDN通过智能技术生成

Hadoop学习（二）

根据B站尚硅谷的视频所作的一些笔记。视频链接如下：

https://www.bilibili.com/video/BV1cW411r7c5

七、MapReduce

7.2 MapReduce序列化

7.2.4 NLineInputFormat 案例

map进程不再按照 block 块去处理切片划分，而是按照指定的行数N来划分。
切片数 = 文件的总行数 / N（如果没有整除，切片数 = 商 + 1）
键值对和 TextInputFormat 生成的键值对是一样。即 key 为偏移量， value 为一行的内容。

1. 需求

对单词进行统计，且没三行划分一个切片。
（1）输入数据：在这里插入图片描述

（2）期望的结果：切片数为 6

2. 实现步骤

（1）创建 xsl.com.mr.nLineInputFormat 包
（2）创建 NLineMapper 类

public class NLineMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
   
    Text k = new Text();                        // 输出数据的key为Text类型
    IntWritable v = new IntWritable(1);   // 输出数据的value为IntWritable类
    // 重写map方法
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   
        // 1.获取一行
        String line = value.toString();
        // 2.按照空格切割单词
        String[] words = line.split(" ");
        // 3.循环写出数据
        for (String word: words) {
   
            k.set(word);        // 设置输出key
            context.write(k, v);
        }
    }
}

（3）创建 NLineReducer 类

public class NLineMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
   
    Text k = new Text();                        // 输出数据的key为Text类型
    IntWritable v = new IntWritable(1);   // 输出数据的value为IntWritable类
    // 重写map方法
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   
        // 1.获取一行
        String line = value.toString();
        // 2.按照空格切割单词
        String[] words = line.split(" ");
        // 3.循环写出数据
        for (String word: words) {
   
            k.set(word);        // 设置输出key
            context.write(k, v);
        }
    }
}

（4）创建 NLineDriver 类

public class NLineDriver {
   
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
   
        args = new String[]{
   "E:\\Code\\hadoop_test\\input5", "E:\\Code\\hadoop_test\\output5"};
        Configuration conf = new Configuration();
        // 1.获取Job对象
        Job job = Job.getInstance(conf);
        // 2.设置jar存储的位置
        job.setJarByClass(NLineDriver.class);
        // 3.关联Map和Reduce类
        job.setMapperClass(NLineMapper.class);
        job.setReducerClass(NLineReducer.class);
        // 4.设置Mapper阶段输出数据的key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        // 5.设置最终数据的输出的key和value
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        // 设置 文件切割类型为 KeyValueInputFormat
        job.setInputFormatClass(NLineInputFormat.class);
        // 设置三行一个切片
        NLineInputFormat.setNumLinesPerSplit(job, 3);
        // 6.设置输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));  // 输入路径
        FileOutputFormat.setOutputPath(job, new Path(args[1])); // 输出路径
        // 7.提交Job
        // job.submit();
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);    // 失败为1，成功为0
    }
}

（5）查看执行结果：
在这里插入图片描述

可以看到分成了6个切片。

7.2.5 自定义 InputFormat 案例

有时候系统自定一的切片实现类并不能满足我们的需求，例如实现大量小文件的合并，这个时候就需要我们自己自定义我们需要的实现类。
自定义的步骤：
（1）首先自定义一个继承 FileInputFormat 的类。
（2）改写 RecordReader，实现一次性读取一个完整的文件将之封装为kv对。
（3）使用 SquenceFileOutPutFormat 将文件合并输出。

1. 需求

将多个小文件先合并成一个 SequenceFile 文件（Hadoop 里它是以 key-value 对的二进制形式来存储文件），它里面存储多个文件，存储文件的形式为文件按路径 + 名称（key），文件内容（value）。
（1）输入数据：
在这里插入图片描述
（2）期望输出：part-r-00000

2. 实现步骤

（1）新建一个包 xsl.com.mr.customInputFormat
（2）在包下自定义一个继承 FileInputFormat 类的 InputFromat 类 WholeFileInputFormat

public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {
   
    @Override
    public RecordReader<Text, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
   
        WholeRecordReader recordReader = new WholeRecordReader();
        recordReader.initialize(inputSplit, taskAttemptContext);
        return recordReader;
    }
}

（3）自定义 RecordReader 类 WholeRecordReader

public class WholeRecordReader extends RecordReader<Text, BytesWritable> {
   

    FileSplit split;                        // 切片
    Configuration configuration;            // 配置信息
    Text k = new Text();                    // key
    BytesWritable v = new BytesWritable();  // value
    boolean isProgress = true;              // 标记位
    // 初始化
    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
   
        this.split = (FileSplit) inputSplit;
        // 获取配置信息
        configuration = taskAttemptContext.getConfiguration();
    }

    // 核心业务逻辑
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
   
        if(isProgress) {
   
            // 1、获取fs对象
            Path path = split.getPath();                        // 拿到切片的路径
            FileSystem fs = path.getFileSystem(configuration);  // 拿到fs对象
            // 2、获取输入流
            FSDataInputStream fsin = fs.open(path);
            // 3、拷贝到缓冲区
            byte[] buffer = new byte[(int) split.getLength()];      // 缓冲区
            IOUtils.readFully(fsin, buffer, 0, buffer.length);  // 将文件的内容读到缓冲区
            // 4、封装value
            v.set(buffer, 0, buffer.length);
            // 5、封装key
            k.set(path.toString());
            // 6、关闭资源
            IOUtils.closeStream(fsin);
            isProgress = false;     // 设置标志位为false
            return true;
        }
        return false;
    }

    // 获取当前的key值
    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
   
        return k;
    }

    // 获取输出的value
    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
   
        return v;
    }

    // 获取进度条
    @Override
    public float getProgress() throws IOException, InterruptedException {
   
        return 0;
    }

    @Override
    public void close() throws IOException {
   

    }
}

（4）编写 SequenceFileMapper 类处理流程

public class SequenceFileMapper extends Mapper<Text, BytesWritable, Text, BytesWritable> {
   
    @Override
    protected void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException {
   
        context.write(key, value);
    }
}

（5）编写 SequenceFileReducer 类处理流程

public class SequenceFileReducer extends Reducer<Text, BytesWritable, Text, BytesWritable> {
   
    @Override
    protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException, InterruptedException {
   
        // 循环写出
        for (BytesWritable value: values) {
   
            context.write(key, value);
        }
    }
}

（6）编写 SequenceFileDriver 类处理流程

public class SequenceFileDriver {
   
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
   
        args = new String[]{
   "E:/Code/hadoop_test/input6", "E:/Code/hadoop_test/output6"};
        Configuration conf = new Configuration();
        // 1、获取job对象
        Job job = Job.getInstance(conf);
        // 2、设置jar的路径
        job.setJarByClass(SequenceFileDriver.class);
        // 3、关联mapper和reducer
        job.setMapperClass(SequenceFileMapper.class);
        job.setReducerClass(SequenceFileReducer.class);
        // 4、设置mapper输出的key和value类型
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);
        // 5、设置最终输出的key和value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);
        // 6、设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // 7、设置输入的InputFormat
        job.setInputFormatClass(WholeFileInputFormat.class);
        // 8、设置输出的OutPutFormat
        job.setOutputFormatClass(SequenceFileOutputFormat.class);
        // 9、提交job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

（7）运行程序并查看输出结果文件
在这里插入图片描述

7.3 分区

例如按照条件将输出结果数据到不同的文件中，会使用分区。默认的 Partitioner（分区）是根据 key 的 hashCode 对 ReduceTasks 个数取模得到的。用户是没有办法控制哪个 key 存储到哪个分区的。

 public int getPartition(K key, V value, int numReduceTasks) {
   
        return (key.hashCode() & 2147483647) % numReduceTasks;
    }

自定义 Partitioner 的步骤：
（1 ）自定义类继承 Partitioner，重写 getPatition( ) 方法。

public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
   
    @Override
    public int getPartition(Text text, FlowBean flowBean, int i) {
   
    
	}
}

（2）在 Job 驱动中，设置自定义 Partitioner。

	// 指定自定义数据分区
	job.setPartitionerClass(ProvincePartitioner.class);

（3）自定义 Partion 后，根据自定义 Partioner 的逻辑设置相应数量的 ReduceTask。

	// 同时指定相应数量的reduce task
	job.setNumReduceTasks(5);

案例：

1. 需求

（1）输入数据
将统计结果按照手机归属地不同省份输出到不同文件中即不同的分区中。
在这里插入图片描述
文件内容：

1	15561654675	192.168.100.1	www.sdfs.com	2481	56515	200
2 	15352545656	192.168.100.2	www.gaadfa.com	2515	51522	200
3	16526505466	192.168.100.3					123		56448	404
4	63156156156	192.168.100.4	www.wewfsd

最低0.47元/天解锁文章

青鸟飞雪

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
hadoop学习过程中的知识记录（二）

Hadoop学习（二）七、MapReduce7.2MapReduce序列化7.2.4NLineInputFormat 案例map进程不再按照 block 块去处理切片划分，而是按照指定的行数N来划分。切片数 = 文件的总行数 / N（如果没有整除，切片数 = 商 + 1）键值对和 TextInputFormat 生成的键值对是一样。即 key 为偏移量， value 为一行的内容。1.需求：对单词进行统计，且没三行划分一个切片。（1）输入数据：（2）期望的结果：切片数为 62.实现（1
复制链接

扫一扫

专栏目录