初学Avro,用Avro文件作为MapReduce任务的输入输出,踩坑无数,我用的是Centos6.10+Hadoop3.0.3服务器集群,在Windows环境下开发MapReduce
1 Avro类找不到
最初我开开心心写好了MapReduce运行的时候,在cmd里面运行我的jar包,报错java.lang.ClassNotFoundException: org.apache.avro.hadoop.io.AvroKeyComparator
java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/avro/hadoop/io/AvroKeyComparator
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.lang.NoClassDefFoundError: org/apache/avro/hadoop/io/AvroKeyComparator
at org.apache.avro.mapreduce.AvroJob.setMapOutputKeySchema(AvroJob.java:93)
at org.apache.avro.mapreduce.AvroMultipleOutputs.setSchema(AvroMultipleOutputs.java:511)
at org.apache.avro.mapreduce.AvroMultipleOutputs.getContext(AvroMultipleOutputs.java:547)
at org.apache.avro.mapreduce.AvroMultipleOutputs.write(AvroMultipleOutputs.java:399)
at org.apache.avro.mapreduce.AvroMultipleOutputs.write(AvroMultipleOutputs.java:378)
at com.visa.edp.common.vssParser.CustomTestMapper.map(CustomTestMapper.java:98)
at com.visa.edp.common.vssParser.CustomTestMapper.map(CustomTestMapper.java:32)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.hadoop.io.AvroKeyComparator
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
在网上查找这个问题的时候,基本都是要各种折腾classpath,或者尝试把avro的包一起打包进去,在我折腾了一圈java classpath、hadoop classpath(包括但不限于修改hadoop-env.sh、hadoop-env.cmd、yarn-site.xml、mapred-site.xml、把avro的jar包复制的到处都是)、打包第三方依赖的方法后,找到了一篇博文解决了这个问题http://blog.sina.com.cn/s/blog_634ddf3b01019ov7.html,使用conf.set("tmpjars", newJarPath)这个api接口,把avro的jar包传给hdfs,最后我在MapReduce中添加如下代码,就可以运行了
Configuration conf = getConf();
JarUtils.addTmpJar( "C:/Software/Hadoop-3.0.3/lib/avro/avro-mapred-1.7.7-hadoop2.jar", conf);
Job job = new Job(conf);
其他方法也不是不行,另一篇博文就讲了几种hadoop第三方依赖的处理问题,可能是我什么地方设置的不对,想尝试的可以参考博文https://blog.csdn.net/lazy0zz/article/details/7505712
2 org.apache.hadoop.mapreduce.TaskAttemptContext接口与类不符
2018-07-24 10:13:40,229 INFO mapreduce.Job: Task Id : attempt_1532395774750_0011_m_000000_1, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
2018-07-24 10:13:47,289 INFO mapreduce.Job: Task Id : attempt_1532395774750_0011_m_000000_2, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
运行成功之后,紧接着报了第二个错,搜索一番得到的结论是,avro-mapred这个jar包基于Hadoop1和Hadoop2编译了两个jar包,要选择和自己Hadoop版本相配的jar包,所以修改pom.xml文件中的avro-mapred依赖,加一个classifier就可以了,我是Hadoop3.0.3,用这个没有问题
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-mapred</artifactId>
<version>1.7.7</version>
<classifier>hadoop2</classifier>
</dependency>
3 org.apache.avro.generic.GenericData$Record cannot be cast to....问题
这个是在读取avro文件的时候报错,表示GenericDataRecord无法转换为我自己用avro-maven-plugin生成的类,又在网上搜索了一番,原来Hadoop是自带了avro.jar包,我上传了avro.jar和avor-mapred.jar两个包,冲突了,所以去掉了conf.set那里avro.jar包的上传,只上传avor-mapred.jar就好了
Error: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to com.tuan.hadoopLearn.avro.TemperaturePair
at com.tuan.hadoopLearn.mapreduce.AvroMaxTemperature$AvroMaxTemperatureMapper.map(AvroMaxTemperature.java:35)
at com.tuan.hadoopLearn.mapreduce.AvroMaxTemperature$AvroMaxTemperatureMapper.map(AvroMaxTemperature.java:29)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
4 一些其他小坑
1 avro-maven-plugin不能自动生成类
这个坑我估计不会有人跟我一样,我把Schema文件的后缀名写错了,改好后就可以自动生成类
2 错误: 找不到或无法加载主类 org.apache.hadoop.mapreduce.v2.app.MRAppMaster
这个错误是我在折腾classpath时候产生的副作用,要解决,首先在服务器(!!!重点,服务器上)执行hadoop classpath,把输出复制到Windows(!!!重点,复制到Windows)的Hadoop配置里的yarn-site.xml中的yarn.application.classpath属性下
<property>
<name>yarn.application.classpath</name>
<value>
/home/vivian/hadoop-3.0.3/etc/hadoop:
/home/vivian/hadoop-3.0.3/share/hadoop/common/lib/*:
/home/vivian/hadoop-3.0.3/share/hadoop/common/*:
/home/vivian/hadoop-3.0.3/share/hadoop/hdfs:
/home/vivian/hadoop-3.0.3/share/hadoop/hdfs/lib/*:
/home/vivian/hadoop-3.0.3/share/hadoop/hdfs/*:
/home/vivian/hadoop-3.0.3/share/hadoop/mapreduce/lib/*:
/home/vivian/hadoop-3.0.3/share/hadoop/mapreduce/*:
/home/vivian/hadoop-3.0.3/share/hadoop/yarn:
/home/vivian/hadoop-3.0.3/share/hadoop/yarn/lib/*:
/home/vivian/hadoop-3.0.3/share/hadoop/yarn/*
</value>
</property>
3 avro-maven-plugin生成的类无法import
这个问题也折腾了很久,一开始我打算把生成的类放在一个avro目录下,所以我的avro-maven-plugin配置中的outputDirectory配成了${project.basedir}/src/main/java/avro,compile后的类出现在了目录下,但是其他类一直无法import这个类成功,Intelli Idea还提示些什么类重复之类让人摸不着头脑的信息。后来看了一下编译后的target文件夹,发现这个生成类是放在classes根目录下的
后来查了一下,生成的类头并没有package信息,所以肯定会放在根目录下,这个问题很好解决,Schema文件的namespace就可以指定生成类的package信息,在Schema文件添加一行"namespace": "****",最后生成的类就会有package ****,也就可以import
outputDirectory只要配置成${project.basedir}/src/main/java/就可以
5 MapReduce任务源代码
代码如下,写的比较随意
package com.tuan.hadoopLearn.mapreduce;
import java.io.IOException;
import com.tuan.hadoopLearn.avro.TemperaturePair;
import com.tuan.hadoopLearn.utils.JarUtils;
import org.apache.avro.Schema;
import org.apache.avro.mapred.AvroKey;
import org.apache.avro.mapred.AvroValue;
import org.apache.avro.mapreduce.AvroJob;
import org.apache.avro.mapreduce.AvroKeyInputFormat;
import org.apache.avro.mapreduce.AvroKeyValueOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class AvroMaxTemperature extends Configured implements Tool {
public static class AvroMaxTemperatureMapper extends
Mapper<AvroKey<TemperaturePair>, NullWritable, IntWritable, IntWritable> {
@Override
public void map(AvroKey<TemperaturePair> key, NullWritable value, Context context)
throws IOException, InterruptedException {
Integer year = key.datum().getYear();
Integer temperature = key.datum().getTemperature();
context.write(new IntWritable(year), new IntWritable(temperature));
}
}
public static class AvroMaxTemperatureReducer extends
Reducer<IntWritable, IntWritable, AvroKey<Integer>, AvroValue<Integer>> {
@Override
public void reduce(IntWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
Integer max = 0;
for (IntWritable value : values) {
max = Math.max(max, value.get());
}
context.write(new AvroKey<Integer>(key.get()), new AvroValue<Integer>(max));
}
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MapReduceMaxTemperature <input path> <output path>");
return -1;
}
Configuration conf = getConf();
JarUtils.addTmpJar( "C:/Software/Hadoop-3.0.3/lib/avro/avro-mapred-1.7.7-hadoop2.jar", conf);
Job job = new Job(conf);
job.setJarByClass(AvroMaxTemperature.class);
job.setJobName("Avro Max Temperature");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setInputFormatClass(AvroKeyInputFormat.class);
job.setMapperClass(AvroMaxTemperatureMapper.class);
AvroJob.setInputKeySchema(job, TemperaturePair.getClassSchema());
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputFormatClass(AvroKeyValueOutputFormat.class);
job.setReducerClass(AvroMaxTemperatureReducer.class);
AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.INT));
AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.INT));
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new AvroMaxTemperature(), args);
System.exit(res);
}
}
TemperaturePair.avsc文件如下
{
"namespace": "com.tuan.hadoopLearn.avro",
"type": "record",
"name": "TemperaturePair",
"doc": "A weather reading.",
"fields": [
{"name": "year", "type": "int"},
{"name": "temperature", "type": "int"}
]
}