注意:alt+/和ctrl+1 是非常重要的快捷键。
1.hadoop的编程模板(针对hdfs上的文件的操作)
//创建conf,会在去加载hadoop的一些配置文件
Configuration conf = new Configuration();
//对conf进行参数配置(要是不用编程,就在配置文件中进行修改)
conf.set("dfs.replication", "2");
conf.set("dfs.blocksize", "64m");
FileSystem fs = FileSystem.get(new URI("hdfs://hdp-001:9000"), conf, "root");
上面的步骤只要是客户端连接hdfs都需要创建
下面是一些hadoop的一些API:
//上传文件到hdfs中去
fs.copyFromLocalFile(new Path("C:\\Users\\liu-xiao-ge\\Desktop\\问题.txt"), new Path("/"));
//下载文件
fs.copyToLocalFile(new Path("/问题.txt"), new Path("C:\\Users\\liu-xiao-ge\\Desktop\\haha.txt"));
//hdfs本地移动文件
fs.rename(new Path("/问题.txt"), new Path("/liu/liu.txt"));
fs.close();
2.转载Hadoop FileSystem常用API的使用
https://blog.csdn.net/pzsoftchen/article/details/17632173
3.MapReduce编程
(1)需要实现mapper类、reducer类的接口
①mapper类的接口实现类WordcountMapper
KEYIN :是map task读取到的数据的key的类型,是一行的起始偏移量Long
VALUEIN:是map task读取到的数据的value的类型,是一行的内容String
KEYOUT:是用户的自定义map方法要返回的结果kv数据的key的类型,在wordcount逻辑中,我们需要返回的是单词String
VALUEOUT:是用户的自定义map方法要返回的结果kv数据的value的类型,在wordcount逻辑中,我们需要返回的是整数Integer
但是,在mapreduce中,map产生的数据需要传输给reduce,需要进行序列化和反序列化,而jdk中的原生序列化机制产生的数据量比较冗余,就会导致数据在mapreduce运行过程中传输效率低下
所以,hadoop专门设计了自己的序列化机制,那么,mapreduce中传输的数据类型就必须实现hadoop自己的序列化接口
hadoop为jdk中的常用基本类型Long String Integer Float等数据类型封住了自己的实现了hadoop序列化接口的类型:LongWritable,Text,IntWritable,FloatWritable
代码如下:
public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// 切单词
String line = value.toString();
String[] words = line.split(" ");
for(String word:words){
context.write(new Text(word), new IntWritable(1));
}
}
}
②实现reducer的实现类WordcountReducer
代码如下:
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int count = 0;
Iterator<IntWritable> iterator = values.iterator();
while(iterator.hasNext()){
IntWritable value = iterator.next();
count += value.get();
}
context.write(key, new IntWritable(count));
}
③需要一个提交Job的类(这个类要有main方法,是主类)
这里需要说明,提交job一共有一下形式:
1)从window提交到yarn上
代码如下:
public static void main(String[] args) throws Exception {
// 在代码中设置JVM系统参数,用于给job对象来获取访问HDFS的用户身份
System.setProperty("HADOOP_USER_NAME", "root");
Configuration conf = new Configuration();
// 1、设置job运行时要访问的默认文件系统
conf.set("fs.defaultFS", "hdfs://hdp-01:9000");
// 2、设置job提交到哪去运行
conf.set("mapreduce.framework.name", "yarn");
conf.set("yarn.resourcemanager.hostname", "hdp-01");
// 3、如果要从windows系统上运行这个job提交客户端程序,则需要加这个跨平台提交的参数
conf.set("mapreduce.app-submission.cross-platform","true");
Job job = Job.getInstance(conf);
// 1、封装参数:jar包所在的位置
job.setJar("D:\\appdev\\hadoop-16\\mapreduce24\\target\\mapreduce24-0.0.1-SNAPSHOT.jar");
//job.setJarByClass(JobSubmitter.class);
// 2、封装参数: 本次job所要调用的Mapper实现类、Reducer实现类
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
// 3、封装参数:本次job的Mapper实现类、Reducer实现类产生的结果数据的key、value类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Path output = new Path("/wordcount/output");
FileSystem fs = FileSystem.get(new URI("hdfs://hdp-01:9000"),conf,"root");
if(fs.exists(output)){
fs.delete(output, true);
}
// 4、封装参数:本次job要处理的输入数据集所在路径、最终结果的输出路径
FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
FileOutputFormat.setOutputPath(job, output); // 注意:输出路径必须不存在
// 5、封装参数:想要启动的reduce task的数量
job.setNumReduceTasks(2);
// 6、提交job给yarn
boolean res = job.waitForCompletion(true);
System.exit(res?0:-1);
}
2)从Linux提交到yarn上
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://hdp-01:9000");
conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
// 没指定默认文件系统
// 没指定mapreduce-job提交到哪运行
Job job = Job.getInstance(conf);
job.setJarByClass(JobSubmitterLinuxToYarn.class);
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));
job.setNumReduceTasks(3);
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
3)在window上本地测试运行
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
//conf.set("fs.defaultFS", "file:///");
//conf.set("mapreduce.framework.name", "local");
Job job = Job.getInstance(conf);
job.setJarByClass(JobSubmitterLinuxToYarn.class);
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("f:/mrdata/wordcount/input"));
FileOutputFormat.setOutputPath(job, new Path("f:/mrdata/wordcount/output"));
job.setNumReduceTasks(3);
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
4.MapReduce的升级版
(1)提交job的类(在windows上本地运行)
代码如下:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(JobSubmit.class);
job.setMapperClass(MapperIp.class);
job.setReducerClass(ReduceIp.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowB.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowB.class);
FileInputFormat.setInputPaths(job, new Path("C:\\Users\\liu-xiao-ge\\Desktop\\1"));
FileOutputFormat.setOutputPath(job, new Path("C:\\Users\\liu-xiao-ge\\Desktop\\1\\1"));
job.setNumReduceTasks(2);
job.waitForCompletion(true);
}
(2)mapper的实现类
代码如下:
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] ls = value.toString().split("\t");
String phone = ls[1];
int upflow = Integer.parseInt(ls[ls.length-3]);
int dflow = Integer.parseInt(ls[ls.length-2]);
context.write(new Text(phone), new FlowB(phone,upflow,dflow));
}
(3)reducer的实现类:
代码如下:
@Override
protected void reduce(Text key, Iterable<FlowB> value, Reducer<Text, FlowB, Text, FlowB>.Context context)
throws IOException, InterruptedException {
int upsum = 0;
int dsum = 0;
for (FlowB flowB : value) {
upsum += flowB.getUpflow();
dsum += flowB.getDflow();
}
context.write(key, new FlowB(key.toString(), upsum, dsum));
}
(4)FlowB类
本案例的功能:演示自定义数据类型如何实现hadoop的序列化接口
该类一定要保留空参构造函数
write方法中输出字段二进制数据的顺序 要与 readFields方法读取数据的顺序一致
代码如下:
public class FlowB implements Writable {
private int upFlow;
private int dFlow;
private String phone;
private int amountFlow;
public FlowB(){}
public FlowB(String phone, int upFlow, int dFlow) {
this.phone = phone;
this.upFlow = upFlow;
this.dFlow = dFlow;
this.amountFlow = upFlow + dFlow;
}
public String getPhone() {
return phone;
}
public void setPhone(String phone) {
this.phone = phone;
}
public int getUpFlow() {
return upFlow;
}
public void setUpFlow(int upFlow) {
this.upFlow = upFlow;
}
public int getdFlow() {
return dFlow;
}
public void setdFlow(int dFlow) {
this.dFlow = dFlow;
}
public int getAmountFlow() {
return amountFlow;
}
public void setAmountFlow(int amountFlow) {
this.amountFlow = amountFlow;
}
/**
* hadoop系统在序列化该类的对象时要调用的方法
*/
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(upFlow);
out.writeUTF(phone);
out.writeInt(dFlow);
out.writeInt(amountFlow);
}
/**
* hadoop系统在反序列化该类的对象时要调用的方法
*/
@Override
public void readFields(DataInput in) throws IOException {
this.upFlow = in.readInt();
this.phone = in.readUTF();
this.dFlow = in.readInt();
this.amountFlow = in.readInt();
}
@Override
public String toString() {
return this.phone + ","+this.upFlow +","+ this.dFlow +"," + this.amountFlow;
}
}