1. 自定义序列化
本文基于统计手机上行流量、下行流量、总流量学习自定义序列化
1.1 新增phone.txt文件
1 13764368888 196.168.0.11 1116 854 200
2 13764368888 196.168.0.11 1136 834 200
3 13764368888 196.168.0.11 1146 824 200
4 13764368888 196.168.0.11 1116 804 200
1.2 编写FlowBean序列化类
// 主要实现Writable接口即可自定义序列化类
public class FlowBean implements Writable {
private long upFlow; // 上行流量
private long downFlow; // 下行流量
private long totalFlow; // 总流量
public FlowBean() {}
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(totalFlow);
}
@Override
public void readFields(DataInput in) throws IOException {
this.upFlow = in.readLong();
this.downFlow = in.readLong();
this.totalFlow = in.readLong();
}
@Override
public String toString() {
return "FlowBean{" +
"upFlow=" + upFlow +
", downFlow=" + downFlow +
", totalFlow=" + totalFlow +
'}';
}
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getTotalFlow() {
return totalFlow;
}
public void setTotalFlow() {
this.totalFlow = this.upFlow + this.downFlow;
}
}
1.3 编写FlowMapper类
public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
private Text keyOut = new Text();
private FlowBean valueOut = new FlowBean();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] split = line.split(" ");
String phone = split[1];
String up = split[3];
String down = split[4];
keyOut.set(phone);
valueOut.setUpFlow(Long.parseLong(up));
valueOut.setDownFlow(Long.parseLong(down));
valueOut.setTotalFlow();
context.write(keyOut, valueOut);
}
}
1.4 编写FlowReduce类
public class FlowReduce extends Reducer<Text, FlowBean, Text, FlowBean> {
private FlowBean valueOut = new FlowBean();
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context) throws IOException, InterruptedException {
long totalUp = 0;
long totalDown = 0;
for (FlowBean flowBean : values) {
totalUp += flowBean.getUpFlow();
totalDown += flowBean.getDownFlow();
}
valueOut.setUpFlow(totalUp);
valueOut.setDownFlow(totalDown);
valueOut.setTotalFlow();
context.write(key, valueOut);
}
}
1.5 编写FlowDriver类
public class FlowDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "flow");
job.setJarByClass(FlowDriver.class);
job.setMapperClass(FlowMapper.class);
job.setCombinerClass(FlowReduce.class);
job.setReducerClass(FlowReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
1.6 传参运行
E:\Java\blogCode\hadoop\src\main\resources\phone.txt E:\Java\blogCode\hadoop\src\main\resources\phone_ret.txt
结果中part-r-00000文件内容如下
13764368888 FlowBean{upFlow=4514, downFlow=3316, totalFlow=7830}
欢迎关注公众号算法小生与我沟通交流
2. InputFormat数据输入
我们先简要了解下InputFormat输入数据
2.1 数据块与数据切片
数据块: Block在HDFS物理上数据分块,默认128M。数据块是HDFS存储数据单位
数据切片: 数据切片只是在逻辑上对输入进行分片,并不会物理上切片存储。数据切片是MapReduce计算输入数据的单位,一个切片对应启动一个MapTask
2.2 数据切片与MapTask并行度决定机制
- 一个Job的Map阶段并行度由客户端在提交Job时的切片数决定
- 每一个Split切片分配一个MapTask并行实例处理
- 默认情况下,切片大小=数据块大小
- 切片时不考虑数据集整体,而是针对每一个文件单独切片
2.3 TextInputFormat
TextInputFormat是默认的FileInputFormat实现类,按行读取每条记录。键是文件中的位置LongWritable类型,值是文本行Text类型
2.4 CombineTextInputFormat
2.4.1 应用场景
TextInputFormat按文件切片,不管文件多小,都会是一个单独的切片,都会交给一个MapTask,这样如果大量小文件,就会产生大量MapTask,影响性能
CombineTextInputFormat用于小文件过多场景,它可以将多个小文件逻辑上规划到一个切片中,这样多个小文件就可以交给一个MapTask处理
2.4.2 Driver中代码配置
job.setInputFormatClass(CombineFileInputFormat.class);
// 虚拟存储切片最大值设置为4M,可根据小文件情况调整
CombineFileInputFormat.setMaxInputSplitSize(job, 4 * 1024 * 1024);
2.4.3 切片机制
生成切片过程包括:虚拟存储过程和切片过程
1)假设有4个小文件abcd[读取时按字典顺序]大小分别为1.7M、5.1M、3.4M以及6.8M这四个小文件
2)虚拟存储之后形成6个文件块,大小分别为:1.7M (2.55M、2.55M)3.4M(3.4M、3.4M),因为ac文件小于4M,分为一块;bd文件大于4M且小于2 * 4M且为了均匀考虑故均分为两块(如果有E文件大小为8.2M,即8.2>2 * 4,则先逻辑划分出4M,剩余4.2在按之前逻辑均匀划分)
3)切片过程判断虚拟存储文件是否大于等于4M,是则单独形成一个切片,否则跟下一个虚拟存储文件合并,共同形成一个切片,故abcd最终会形成3个切片,大小分别为:(1.7+2.55)M,(2.55+3.4)M,(3.4+3.4)M
3. OutputFormat数据输出
3.1 需求描述
我们希望将日志文件site.txt中访问shenjian.online记录导出至shenjian.log,其他域名访问导出至other.log
site.txt内容如下
shenjian.online
bitcoin.org
shenjian.online
bitcoin.org
bitcoin.org
3.2 新建LogMapper类与LogReducer类
public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
context.write(value, NullWritable.get());
}
}
public class LogReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Reducer<Text, NullWritable, Text, NullWritable>.Context context) throws IOException, InterruptedException {
// 防止相同数据丢失
for (NullWritable value : values) {
context.write(key, NullWritable.get());
}
}
}
3.3 新建LogOutputFormat与LogRecordWriter
public class LogOutputFormat extends FileOutputFormat<Text, NullWritable> {
@Override
public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
LogRecordWriter logRecordWriter = new LogRecordWriter(job);
return logRecordWriter;
}
}
public class LogRecordWriter extends RecordWriter<Text, NullWritable> {
private FSDataOutputStream dataOutputStreamOne;
private FSDataOutputStream dataOutputStreamTwo;
public LogRecordWriter(TaskAttemptContext job) {
try {
FileSystem fileSystem = FileSystem.get(job.getConfiguration());
dataOutputStreamOne = fileSystem.create(new Path("hadoop/src/main/resources/site/shenjian.log"));
dataOutputStreamTwo = fileSystem.create(new Path("hadoop/src/main/resources/site/other.log"));
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void write(Text key, NullWritable value) throws IOException, InterruptedException {
String log = key.toString();
if (log.contains("shenjian.online")) {
dataOutputStreamOne.writeBytes(log + "\n");
} else {
dataOutputStreamTwo.writeBytes(log + "\n");
}
}
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
IOUtils.closeStream(dataOutputStreamOne);
IOUtils.closeStream(dataOutputStreamTwo);
}
}
3.4 新建LogDriver类
public class LogDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "log output format");
job.setJarByClass(LogDriver.class);
job.setMapperClass(LogMapper.class);
job.setReducerClass(LogReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setOutputFormatClass(LogOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
结果如图所示,GOOD
欢迎关注公众号算法小生获取更多免费技术资源
欢迎关注公众号算法小生与我沟通交流