11.hadoop系列之MapReduce框架原理之自定义序列化与数据输入输出

1. 自定义序列化

本文基于统计手机上行流量、下行流量、总流量学习自定义序列化

1.1 新增phone.txt文件

1 13764368888 196.168.0.11 1116 854 200
2 13764368888 196.168.0.11 1136 834 200
3 13764368888 196.168.0.11 1146 824 200
4 13764368888 196.168.0.11 1116 804 200

1.2 编写FlowBean序列化类

// 主要实现Writable接口即可自定义序列化类
public class FlowBean implements Writable {

    private long upFlow; // 上行流量
    private long downFlow; // 下行流量
    private long totalFlow; // 总流量

    public FlowBean() {}

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(totalFlow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.upFlow = in.readLong();
        this.downFlow = in.readLong();
        this.totalFlow = in.readLong();
    }

    @Override
    public String toString() {
        return "FlowBean{" +
                "upFlow=" + upFlow +
                ", downFlow=" + downFlow +
                ", totalFlow=" + totalFlow +
                '}';
    }

    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getTotalFlow() {
        return totalFlow;
    }

    public void setTotalFlow() {
        this.totalFlow = this.upFlow + this.downFlow;
    }
}

1.3 编写FlowMapper类

public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {

    private Text keyOut = new Text();
    private FlowBean valueOut = new FlowBean();

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] split = line.split(" ");
        String phone = split[1];
        String up = split[3];
        String down = split[4];

        keyOut.set(phone);
        valueOut.setUpFlow(Long.parseLong(up));
        valueOut.setDownFlow(Long.parseLong(down));
        valueOut.setTotalFlow();

        context.write(keyOut, valueOut);
    }
}

1.4 编写FlowReduce类

public class FlowReduce extends Reducer<Text, FlowBean, Text, FlowBean> {

    private FlowBean valueOut = new FlowBean();

    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context) throws IOException, InterruptedException {
        long totalUp = 0;
        long totalDown = 0;

        for (FlowBean flowBean : values) {
            totalUp += flowBean.getUpFlow();
            totalDown += flowBean.getDownFlow();
        }

        valueOut.setUpFlow(totalUp);
        valueOut.setDownFlow(totalDown);
        valueOut.setTotalFlow();

        context.write(key, valueOut);
    }
}

1.5 编写FlowDriver类

public class FlowDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "flow");
        job.setJarByClass(FlowDriver.class);
        job.setMapperClass(FlowMapper.class);
        job.setCombinerClass(FlowReduce.class);
        job.setReducerClass(FlowReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

1.6 传参运行

E:\Java\blogCode\hadoop\src\main\resources\phone.txt E:\Java\blogCode\hadoop\src\main\resources\phone_ret.txt

结果中part-r-00000文件内容如下

13764368888	FlowBean{upFlow=4514, downFlow=3316, totalFlow=7830}

欢迎关注公众号算法小生与我沟通交流

2. InputFormat数据输入

我们先简要了解下InputFormat输入数据

2.1 数据块与数据切片

数据块: Block在HDFS物理上数据分块,默认128M。数据块是HDFS存储数据单位
数据切片: 数据切片只是在逻辑上对输入进行分片,并不会物理上切片存储。数据切片是MapReduce计算输入数据的单位,一个切片对应启动一个MapTask

2.2 数据切片与MapTask并行度决定机制

  1. 一个Job的Map阶段并行度由客户端在提交Job时的切片数决定
  2. 每一个Split切片分配一个MapTask并行实例处理
  3. 默认情况下,切片大小=数据块大小
  4. 切片时不考虑数据集整体,而是针对每一个文件单独切片

2.3 TextInputFormat

TextInputFormat是默认的FileInputFormat实现类,按行读取每条记录。键是文件中的位置LongWritable类型,值是文本行Text类型

2.4 CombineTextInputFormat

2.4.1 应用场景

TextInputFormat按文件切片,不管文件多小,都会是一个单独的切片,都会交给一个MapTask,这样如果大量小文件,就会产生大量MapTask,影响性能
CombineTextInputFormat用于小文件过多场景,它可以将多个小文件逻辑上规划到一个切片中,这样多个小文件就可以交给一个MapTask处理

2.4.2 Driver中代码配置

job.setInputFormatClass(CombineFileInputFormat.class);
// 虚拟存储切片最大值设置为4M,可根据小文件情况调整
CombineFileInputFormat.setMaxInputSplitSize(job, 4 * 1024 * 1024);

2.4.3 切片机制

生成切片过程包括:虚拟存储过程和切片过程

1)假设有4个小文件abcd[读取时按字典顺序]大小分别为1.7M、5.1M、3.4M以及6.8M这四个小文件
2)虚拟存储之后形成6个文件块,大小分别为:1.7M (2.55M、2.55M)3.4M(3.4M、3.4M),因为ac文件小于4M,分为一块;bd文件大于4M且小于2 * 4M且为了均匀考虑故均分为两块(如果有E文件大小为8.2M,即8.2>2 * 4,则先逻辑划分出4M,剩余4.2在按之前逻辑均匀划分)
3)切片过程判断虚拟存储文件是否大于等于4M,是则单独形成一个切片,否则跟下一个虚拟存储文件合并,共同形成一个切片,故abcd最终会形成3个切片,大小分别为:(1.7+2.55)M,(2.55+3.4)M,(3.4+3.4)M

3. OutputFormat数据输出

3.1 需求描述

我们希望将日志文件site.txt中访问shenjian.online记录导出至shenjian.log,其他域名访问导出至other.log

site.txt内容如下

shenjian.online
bitcoin.org
shenjian.online
bitcoin.org
bitcoin.org

3.2 新建LogMapper类与LogReducer类

public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
        context.write(value, NullWritable.get());
    }
}
public class LogReducer extends Reducer<Text, NullWritable, Text, NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Reducer<Text, NullWritable, Text, NullWritable>.Context context) throws IOException, InterruptedException {
        // 防止相同数据丢失
        for (NullWritable value : values) {
            context.write(key, NullWritable.get());
        }
    }
}

3.3 新建LogOutputFormat与LogRecordWriter

public class LogOutputFormat extends FileOutputFormat<Text, NullWritable> {

    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
        LogRecordWriter logRecordWriter = new LogRecordWriter(job);
        return logRecordWriter;
    }
}
public class LogRecordWriter extends RecordWriter<Text, NullWritable> {

    private FSDataOutputStream dataOutputStreamOne;
    private FSDataOutputStream dataOutputStreamTwo;

    public LogRecordWriter(TaskAttemptContext job) {
        try {
            FileSystem fileSystem = FileSystem.get(job.getConfiguration());
            dataOutputStreamOne = fileSystem.create(new Path("hadoop/src/main/resources/site/shenjian.log"));
            dataOutputStreamTwo = fileSystem.create(new Path("hadoop/src/main/resources/site/other.log"));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void write(Text key, NullWritable value) throws IOException, InterruptedException {
        String log = key.toString();
        if (log.contains("shenjian.online")) {
            dataOutputStreamOne.writeBytes(log + "\n");
        } else {
            dataOutputStreamTwo.writeBytes(log + "\n");
        }
    }

    @Override
    public void close(TaskAttemptContext context) throws IOException, InterruptedException {
        IOUtils.closeStream(dataOutputStreamOne);
        IOUtils.closeStream(dataOutputStreamTwo);
    }
}

3.4 新建LogDriver类

public class LogDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "log output format");
        job.setJarByClass(LogDriver.class);
        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        job.setOutputFormatClass(LogOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

结果如图所示,GOOD

欢迎关注公众号算法小生获取更多免费技术资源

欢迎关注公众号算法小生与我沟通交流

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

算法小生Đ

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值