一、MapReduce编程规范
Mapper阶段
extends Mapper<LongWritable, Text, Text, IntWritable>
LongWritable, Text 输入K1V1
Text, IntWritable 输出K2V2
map
处理我们的业务逻辑
MapTask中运行的
每个K1V1执行一次
Reducer阶段
extends Reducer<Text, IntWritable, Text, IntWritable>
Text, IntWritable:Reducer的输入数据类型 = Mapper的输出类型
Text, IntWritable:Reducer的输出类型
reduce
完成业务逻辑操作
是在ReduceTask中执行
相同的key对应调用一次
Driver
组装
设置运行的主类
Mapper/Reducer
Map的输出KV类型
Reduce的输出KV类型
输入输出:输出一定要配置,不能事先存在
二、序列化
XXXWritable、Text:序列化对象
序列化:Hadoop、Spark、Flink
内存中的对象 ==> 字节数组
反序列化:字节数组 ==> 内存中的对象
分布式计算框架里,是需要序列化/反序列化 网络传输
Java:
Hadoop自定义序列化实现 Writable
紧凑 速度 扩展 互操作
需求:Key Value是普通的Hadoop build-in支持不了
==> 自定义序列化类
public interface WritableComparable extends Writable, Comparable {
Writable 自定义序列化类的顶层类
每个手机号上下行流量和以及总流量
==>
手机号:第二个字段
上行流量:倒数第三个字段
下行流量:倒数第二个字段
总流量: 上行+下行
Mapper: <LongWritable, Text, Text,Access>
Access:phone,up,down,sum
Reducer:<Text,Access,NullWritable,Access>
java.lang.NoSuchMethodException:
com.ruozedata.bigdata.hadoop.mapreduce.ser.Access.()
构造器的问题,要求要有一个无参的构造器
自定义序列化类的实现步骤
1)implements Writable
2)留一个无参构造
3)write和readFields
4)写出去的字段顺序和读进来的字段顺序必须一致
5)可选的:toString
int maps = writeSplits(job, submitJobDir);
InputFormat<?, ?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
已经明确知道现在使用的InputFormat
在MapReduce里面如果你想读数据一定要用到InputFormat
只不过是InputFormat的某个子类
InputSplit 被一个Mapper处理 等同于Block
200M ==> 128M + 72M
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
1
long maxSize = getMaxSplitSize(job);
Long.MAX_VALUE
isSplitable:你的输入文件是否能被切分
200M ==> 128 72 ==> 2MapTasks
200M ==> 1MapTasks
long blockSize = file.getBlockSize(); // 128M
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
return Math.max(minSize, Math.min(maxSize, blockSize));
max(1, min(Long.MAX_VALUE, 128M))
==> max(1, 128M)
==> 128M
long bytesRemaining = length; 150M
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP)
150M/128M
SPLIT_SLOP = 1.1 10%
129M block:? InputSplit=1
128M
…
…|…
FileInputFormat.setInputPaths(job, new Path(input));
input:可以是文件,也可以是文件夹
遍历文件夹==>Path==>BlockLocation <== 你输入的数据在哪个节点上
拿到文件,对应的大小
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
Math.max(minSize, Math.min(maxSize, blockSize));
max(1, min(Long.MAX_VALUE, 128M))
maxSize > blockSize
但是 别动
100M 128BLOCK 100 28
TextInputFormat extends FileInputFormat {
isSplitable
RecordReader<LongWritable, Text> createRecordReader
}
TextInputFormat是FileInputFormat的实现类
按照行进行数据的读取
K:LongWritable 该行数据在整个文件中的offset
V:Text 该行数据的内容
MySQLReadDriver
一定是可以在本地运行的
是不能在服务器运行的!!!
本地:mysql驱动
==> mysql驱动传到服务器上去
MySQLReadDriverV2这个版本 生产上推荐的
extends Configured implements Tool
把mysql jar加载到hadoop能访问的到的路径 *****
流量统计
package com.ccj.pxj.phone;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class Access implements Writable {
private String phone;
private long up;
private long down;
private long sum;
public String getPhone() {
return phone;
}
public void setPhone(String phone) {
this.phone = phone;
}
public long getUp() {
return up;
}
public void setUp(long up) {
this.up = up;
}
public long getDown() {
return down;
}
public void setDown(long down) {
this.down = down;
}
public Access(String phone, long up, long down) {
this.phone = phone;
this.up = up;
this.down = down;
this.sum=up+down;
}
public Access() {
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(phone);
out.writeLong(up);
out.writeLong(down);
out.writeLong(sum);
}
@Override
public void readFields(DataInput in) throws IOException {
this.phone= in.readUTF();
this.up= in.readLong();
this.down=in.readLong();
this.sum=in.readLong();
}
@Override
public String toString() {
return
phone + '\t' +
up +
"\t" + down +
"\t" + sum ;
}
}
package com.ccj.pxj.phone;
import com.ccj.pxj.phone.utils.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class SerDriver {
public static void main(String[] args) throws Exception {
String input = "data";
String output = "out";
// 1)获取Job对象
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
FileUtils.deleteOutput(configuration, output);
// 2)本job对应要执行的主类是哪个
job.setJarByClass(SerDriver.class);
// 3)设置Mapper和Reducer
job.setMapperClass(MyMaper .class);
job.setReducerClass(MyReduce.class);
// 4)设置Mapper阶段输出数据的类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Access.class);
// 5)设置Reducer阶段输出数据的类型
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Access.class);
// 6)设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
// 7)提交作业
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
public static class MyMaper extends Mapper<LongWritable, Text,Text,Access>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] data = value.toString().split("\t");
String phone = data[1];
// 上行流量
long up = Long.parseLong(data[data.length - 3]);
// 下行流量
long down = Long.parseLong(data[data.length - 2]);
context.write(new Text(phone),new Access(phone,up,down));
}
}
public static class MyReduce extends Reducer<Text,Access, NullWritable,Access>{
@Override
protected void reduce(Text key, Iterable<Access> values, Context context) throws IOException, InterruptedException {
long ups=0;
long downs=0;
for (Access value : values) {
ups+=value.getUp();
downs+=value.getDown();
}
context.write(NullWritable.get(),new Access(key.toString(),ups,downs));
}
}
}
MySQL
package com.ccj.wfy.mysql.mr;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
public class DeptWritable implements DBWritable, Writable {
private int deptno;
private String dname;
private String loc;
public int getDeptno() {
return deptno;
}
public void setDeptno(int deptno) {
this.deptno = deptno;
}
public String getDname() {
return dname;
}
public void setDname(String dname) {
this.dname = dname;
}
public String getLoc() {
return loc;
}
public DeptWritable() {
}
public DeptWritable(int deptno, String dname, String loc) {
this.deptno = deptno;
this.dname = dname;
this.loc = loc;
}
public void setLoc(String loc) {
this.loc = loc;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(deptno);
out.writeUTF(dname);
out.writeUTF(loc);
}
@Override
public void readFields(DataInput in) throws IOException {
this.deptno= in.readInt();
this.dname=in.readUTF();
this.loc=in.readUTF();
}
@Override
public void write(PreparedStatement statement) throws SQLException {
statement.setInt(1,deptno);
statement.setString(2,dname);
statement.setString(3,loc);
}
@Override
public void readFields(ResultSet result) throws SQLException {
deptno=result.getInt(1);
dname=result.getString(2);
loc=result.getString(3);
}
@Override
public String toString() {
return
deptno +
"\t" + dname +
loc
;
}
}
package com.ccj.wfy.mysql.mr;
import com.ccj.pxj.phone.utils.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MySQLReadDriver {
public static void main(String[] args) throws Exception {
String output = "out";
// 1)获取Job对象
Configuration configuration = new Configuration();
// configuration.set(DBConfiguration.DRIVER_CLASS_PROPERTY, "com.mysql.jdbc.Driver");
DBConfiguration.configureDB(configuration, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/mrtest", "root", "");
Job job = Job.getInstance(configuration);
FileUtils.deleteOutput(configuration, output);
// 2)本job对应要执行的主类是哪个
job.setJarByClass(MySQLReadDriver.class);
// 3)设置Mapper和Reducer
job.setMapperClass(MyMapper.class);
// 4)设置Mapper阶段输出数据的类型
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(DeptWritable.class);
// 6)设置输入和输出路径
String[] fields = {"deptno", "dname", "loc"};
DBInputFormat.setInput(job, DeptWritable.class, "dept", null, null, fields);
FileOutputFormat.setOutputPath(job, new Path(output));
// 7)提交作业
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
public static class MyMapper extends Mapper<LongWritable, DeptWritable, NullWritable, DeptWritable> {
@Override
protected void map(LongWritable key, DeptWritable value, Context context) throws IOException, InterruptedException {
context.write(NullWritable.get(), value);
}
}
}
package com.ccj.wfy.mysql.mr;
import com.ccj.pxj.phone.utils.FileUtils;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;
public class MySQLReadDriverV2 extends Configured implements Tool {
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
int run = ToolRunner.run(configuration, new MySQLReadDriverV2(), args);
System.exit(run);
}
@Override
public int run(String[] strings) throws Exception {
String output = "out1";
// 1)获取Job对象
Configuration configuration = super.getConf();
// configuration.set(DBConfiguration.DRIVER_CLASS_PROPERTY, "com.mysql.jdbc.Driver");
DBConfiguration.configureDB(configuration, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/mrtest", "root", "");
Job job = Job.getInstance(configuration);
FileUtils.deleteOutput(configuration, output);
// 2)本job对应要执行的主类是哪个
job.setJarByClass(MySQLReadDriverV2.class);
// 3)设置Mapper和Reducer
job.setMapperClass(MyMapper.class);
// 4)设置Mapper阶段输出数据的类型
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(DeptWritable.class);
// 6)设置输入和输出路径
String[] fields = {"deptno", "dname", "loc"};
DBInputFormat.setInput(job, DeptWritable.class, "dept", null, null, fields);
FileOutputFormat.setOutputPath(job, new Path(output));
// 7)提交作业
boolean result = job.waitForCompletion(true);
return 1;
}
public static class MyMapper extends Mapper<LongWritable, DeptWritable, NullWritable, DeptWritable> {
@Override
protected void map(LongWritable key, DeptWritable value, Context context) throws IOException, InterruptedException {
context.write(NullWritable.get(), value);
}
}
}
pom配置
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cc.pxj.wfy</groupId>
<artifactId>phoneWcRuoZe</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<hadoop.version>2.6.0-cdh5.16.2</hadoop.version>
</properties>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<!-- 添加Hadoop依赖 -->
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.17</version>
</dependency>
</dependencies>
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
补充MySQL的上传服务器的操作:
[pxj@pxj /home/pxj/lib]$export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/pxj/app/hive-1.1.0-cdh5.16.2/lib/mysql-connector-java-5.1.27-bin.jar
[pxj@pxj /home/pxj/app/hive-1.1.0-cdh5.16.2/lib]$hadoop jar /home/pxj/lib/phoneWcRuoZe-1.0-SNAPSHOT.jar com.ccj.wfy.mysql.mr.MySQLReadDriverV2 -libjars ~/lib/mysql-connector-java-5.1.27-bin.jar
作者:pxj(潘陈)
日期:2020-01-07 1:41:32