- Configuration
- conf.addDefaultResource, conf.addResource, configuration overridden
<property>
<name>fs.defaultFS</name>
<value>file:/// or hdfs://namenode</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>local or yarn</value>
</property>
hadoop fs -conf conf/config-file.xml -ls .
hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro output
-D p=v,
-conf filename
-fs uri -> hdfs namenode
-jt host:port -> resource manager address
-files f1, f2
-archives a1, a2,
-libjars jar1, jar2
HADOOP_USER_NAME environment variable
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "file:///");
conf.set("mapreduce.framework.name", "local");
conf.setInt("mapreduce.task.io.sort.mb", 1);
- Maven setup, hadoop-client, mrunit, minicluste
class YourJobDriver extends Configured implements Tool
ToolRunner.run(instance of your Driver, args)
Unit Test: MapDriver<>.withMapper.withInput.withOutput; ReduceDriver<>
- classpath
- client (in Driver JVM): jar, jar files under lib folder or classees dir of the jar, and HADOOP_CLASSPATH
- cluster: jar, jar files under lib folder or classees dir of the jar, and -libjars or Job.addFileToClassPath added jars to the distributed cache.
- user classes may be conflict with hadoop classes. use HADOOP_USER_CLASSPATH_FIRST to use user classes first.
- Debug job
- collect info
context.setStatus, context.getCounter(Enum).increment
mapreduce.task.files.preserve.failedtasks to keep files for failed tasks
mapreduce.task.files.preserve.filepattern to keep files for successful tasks
yarn.nodemanager.delete.debug-delaysec to delete after seconds
- logs
hadoop deamon logs under HADOOP_LOG_DIR
mapreduce task logs under YARN_LOG_IDR
HDFS audit log (hdfs request)
job history logs
- HPROF profiler (mapreduce.task.profile to true)