第二章:小朱笔记hadoop之源码分析-脚本分析
第一节:start-all.sh
第二节:hadoop-config.sh
第三节:hadoop-env.sh
第四节:start-dfs.sh
第五节:hadoop-daemon.sh
第六节:hadoop-daemons.sh
第七节:slaves.sh
第八节:start-mapred.sh
第九节:hadoop
第一节:start-all.sh
(1)内容:
# Start all hadoop daemons. Run this on master node.
bin=`dirname "$0"`
bin=`cd "$bin"; pwd`
if [ -e "$bin/../libexec/hadoop-config.sh" ]; then
. "$bin"/../libexec/hadoop-config.sh
else
. "$bin/hadoop-config.sh"
fi
# start dfs daemons
"$bin"/start-dfs.sh --config $HADOOP_CONF_DIR
# start mapred daemons
"$bin"/start-mapred.sh --config $HADOOP_CONF_DIR
(2)分析
根据运行此脚本的目录进入安装hadoop目录下的bin目录,获取hadoop-config.sh 配置,对一些变量进行赋值,然后运行启动hdfs和mapred的启动脚本。
第二节:hadoop-config.sh
(1)内容
# Allow alternate conf dir location.
if [ -e "${HADOOP_PREFIX}/conf/hadoop-env.sh" ]; then
DEFAULT_CONF_DIR="conf"
else
DEFAULT_CONF_DIR="etc/hadoop"
fi
HADOOP_CONF_DIR="${HADOOP_CONF_DIR:-$HADOOP_PREFIX/$DEFAULT_CONF_DIR}"
#check to see it is specified whether to use the slaves or the
# masters file
if [ $# -gt 1 ]
then
if [ "--hosts" = "$1" ]
then
shift
slavesfile=$1
shift
export HADOOP_SLAVES="${HADOOP_CONF_DIR}/$slavesfile"
fi
fi
#判断配置文件所在的目录下是否有hadoop-env.sh脚本,有就执行
if [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then
. "${HADOOP_CONF_DIR}/hadoop-env.sh"
fi
(2)分析
对一些变量进行赋值,这些变量有HADOOP_HOME(hadoop的安装目录),HADOOP_CONF_DIR(hadoop的配置文件目录),HADOOP_SLAVES(--slaves指定的文件的地址)。这个脚本又会执行hadoop-env.sh脚本来设置用户配置的相关环境变量。
第三节:hadoop-env.sh
(1)内容
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
export JAVA_HOME=/opt/java
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000
# Extra Java runtime options. Empty by default.
# export HADOOP_OPTS=-server
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
(2)分析
设置java执行的相关参数,例如JAVA_HOME变量(必须设置)、运行jvm的最大堆空间等
第四节:start-dfs.sh
(1)内容
usage="Usage: start-dfs.sh [-upgrade|-rollback]"
.....
# start dfs daemons
# start namenode after datanodes, to minimize time namenode is up w/o data
# note: datanodes will log connection errors until namenode starts
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
(2)分析
此脚本只支持upgrade和rollback两个选项参数,一个参数用于更新文件系统,另一个是回滚文件系统。然后启动namenode、datanode和secondarynamenode节点
第五节:hadoop-daemon.sh
(1)分析
HADOOP_LOG_DIR 存放日志文件的目录。默认是 PWD 命令产生的目录
HADOOP_MASTER host:path where hadoop code should be rsync'd from
HADOOP_PID_DIR The pid files are stored. /tmp by default.
HADOOP_IDENT_STRING A string representing this instance of hadoop. $USER by default
HADOOP_NICENESS The scheduling priority for daemons. Defaults to 0.
第一步:判断所带的参数是否小于1,如果小于就打印使用此脚本的使用帮助
# if no args specified, show usage
if [ $# -le 1 ]; then
echo $usage
exit 1
fi
第二步:执行hadoop-config.sh设值变量,并保存启动还是停止的命令和相关参数
if [ -e "$bin/../libexec/hadoop-config.sh" ]; then
. "$bin"/../libexec/hadoop-config.sh
else
. "$bin/hadoop-config.sh"
fi
# get arguments
startStop=$1
shift
command=$1
shift
第三步:定义回滚日志函数hadoop_rotate_log
hadoop_rotate_log ()
{
log=$1;
num=5;
if [ -n "$2" ]; then
num=$2
fi
if [ -f "$log" ]; then # rotate logs
while [ $num -gt 1 ]; do
prev=`expr $num - 1`
[ -f "$log.$prev" ] && mv "$log.$prev" "$log.$num"
num=$prev
done
mv "$log" "$log.$num";
fi
}
第四步、执行hadoop-env.sh并设值环境变量
if [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then
. "${HADOOP_CONF_DIR}/hadoop-env.sh"
fi
第五步、执行start或者stop,我们分开分析
start:
启动namenode的命令,那么首先创建存放pid文件的目录,如果存放pid的文件已经存在说明已经有namenode节点已经在运行了,那么就先停止在启动。然后根据日志滚动函数生成日志文件,最后就用nice根据调度优先级启动namenode,但是最终的启动还在另一个脚本hadoop,这个脚本是启动所有节点的终极脚本,它会选择一个带有main函数的类用java启动,这样才到达真正的启动java守护进程的效果,这个脚本是启动的重点,也是我们分析hadoop源码的入口处。
(1)验证pid的文件
mkdir -p "$HADOOP_PID_DIR"
if [ -f $pid ]; then
if kill -0 `cat $pid` > /dev/null 2>&1; then
echo $command running as process `cat $pid`. Stop it first.
exit 1
fi
fi
(2)根据日志滚动函数生成日志文件
if [ "$HADOOP_MASTER" != "" ]; then
echo rsync from $HADOOP_MASTER
rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $HADOOP_MASTER/ "$HADOOP_HOME"
fi
hadoop_rotate_log $log
echo starting $command, logging to $log
(3)nice根据调度优先级启动
cd "$HADOOP_PREFIX"
nohup nice -n $HADOOP_NICENESS "$HADOOP_PREFIX"/bin/hadoop --config $HADOOP_CONF_DIR $command "$@" > "$log" 2>&1 < /dev/null &
echo $! > $pid
注意启动后会发现在/tmp目录中出现下面的文件,这些文件保存了该进程的进程号,可用于kill操作
-rw-rw-r-- 1 zhuhui zhuhui 6 3月 6 17:25 hadoop-zhuhui-jobtracker.pid
-rw-rw-r-- 1 zhuhui zhuhui 6 3月 6 17:25 hadoop-zhuhui-namenode.pid
-rw-rw-r-- 1 zhuhui zhuhui 6 3月 6 17:25 hadoop-zhuhui-secondarynamenode.pid
-rw-rw-r-- 1 zhuhui zhuhui 6 3月 6 17:25 hadoop-zhuhui-tasktracker.pid
stop
stop命令就执行简单的停止命令,执行 kill `cat $pid` 操作
if [ -f $pid ]; then
if kill -0 `cat $pid` > /dev/null 2>&1; then
echo stopping $command
kill `cat $pid`
else
echo no $command to stop
fi
else
echo no $command to stop
fi
;;
第六节:hadoop-daemons.sh
(1)内容
usage="Usage: hadoop-daemons.sh [--config confdir] [--hosts hostlistfile] [start|stop] command args..."
# if no args specified, show usage
if [ $# -le 1 ]; then
echo $usage
exit 1
fi
bin=`dirname "$0"`
bin=`cd "$bin"; pwd`
if [ -e "$bin/../libexec/hadoop-config.sh" ]; then
. "$bin"/../libexec/hadoop-config.sh
else
. "$bin/hadoop-config.sh"
fi
exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_HOME" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "$@"
(2)分析
通过hadoop-daemon.sh 脚本来启动的,只是在这之前做了一些特殊处理,就是执行另一个脚本slaves.sh
第七节:slaves.sh
(1)内容
# If the slaves file is specified in the command line,
# then it takes precedence over the definition in
# hadoop-env.sh. Save it here.
HOSTLIST=$HADOOP_SLAVES
if [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then
. "${HADOOP_CONF_DIR}/hadoop-env.sh"
fi
if [ "$HOSTLIST" = "" ]; then
if [ "$HADOOP_SLAVES" = "" ]; then
export HOSTLIST="${HADOOP_CONF_DIR}/slaves"
else
export HOSTLIST="${HADOOP_SLAVES}"
fi
fi
for slave in `cat "$HOSTLIST"|sed "s/#.*$//;/^$/d"`; do
ssh $HADOOP_SSH_OPTS $slave $"${@// /\\ }" \
2>&1 | sed "s/^/$slave: /" &
if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then
sleep $HADOOP_SLAVE_SLEEP
fi
done
wait
(2)分析
首先找到所有从节点的主机名称(在slaves文件中,或者配置文件中配置有),然后通过for循环依次通过ssh远程后台运行启动脚本程序,最后等待程序完成才退出此shell脚本。
因此这个脚本主要完成的功能就是在所有从节点执行启动相应节点的脚本。这个脚本执行datanode是从slaves文件中找到datanode节点,执行secondarynamenode是在master文件找到节点主机(在start-dfs.sh脚本中用-hosts master指定的,不然默认会找到slaves文件,datanode就是按默认找到的)
第八节:start-mapred.sh
(1)内容
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker
(2)分析
分别启动jobtracker和tasktracker节点
第九节:hadoop
可以启动各个节点的服务,还能够执行很多命令和工具。它会根据传入的参数来决定执行什么样的功能(包括启动各个节点服务)
echo "where COMMAND is one of:"
echo " namenode -format format the DFS filesystem"
echo " secondarynamenode run the DFS secondary namenode"
echo " namenode run the DFS namenode"
echo " datanode run a DFS datanode"
echo " dfsadmin run a DFS admin client"
echo " mradmin run a Map-Reduce admin client"
echo " fsck run a DFS filesystem checking utility"
echo " fs run a generic filesystem user client"
echo " balancer run a cluster balancing utility"
echo " fetchdt fetch a delegation token from the NameNode"
echo " jobtracker run the MapReduce job Tracker node"
echo " pipes run a Pipes job"
echo " tasktracker run a MapReduce task Tracker node"
echo " historyserver run job history servers as a standalone daemon"
echo " job manipulate MapReduce jobs"
echo " queue get information regarding JobQueues"
echo " version print the version"
echo " jar <jar> run a jar file"
echo " distcp <srcurl> <desturl> copy file or directories recursively"
echo " archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive"
echo " classpath prints the class path needed to get the"
echo " Hadoop jar and the required libraries"
echo " daemonlog get/set the log level for each daemon"
echo " or"
echo " CLASSNAME run the class named CLASSNAME"
echo "Most commands print help when invoked w/o parameters."
exit 1
第一步:执行hadoop-config.sh设值环境变量
if [ -e "$bin"/../libexec/hadoop-config.sh ]; then
. "$bin"/../libexec/hadoop-config.sh
else
. "$bin"/hadoop-config.sh
fi
第二步:判断是否为cygwin环境
cygwin=false
case "`uname`" in
CYGWIN*) cygwin=true;;
esac
第三步:参数校验
判断参数是否为 namenode -format、secondarynamenode、namenode、datanode、dfsadmin、mradmin、fsck 、fs、balancer、fetchdt、jobtracker、pipes、tasktracker、historyserver、job、queue、version、jar <jar> 、distcp 、archive、classpath 、daemonlog等
第四步:设值java环境变量JAVA、JAVA_HEAP_MAX、CLASSPATH
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m
# CLASSPATH initially contains $HADOOP_CONF_DIR
CLASSPATH="${HADOOP_CONF_DIR}"
if [ "$HADOOP_USER_CLASSPATH_FIRST" != "" ] && [ "$HADOOP_CLASSPATH" != "" ] ; then
CLASSPATH=${CLASSPATH}:${HADOOP_CLASSPATH}
fi
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
# for developers, add Hadoop classes to CLASSPATH
if [ -d "$HADOOP_HOME/build/classes" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME/build/classes
fi
if [ -d "$HADOOP_HOME/build/webapps" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME/build
fi
if [ -d "$HADOOP_HOME/build/test/classes" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME/build/test/classes
fi
if [ -d "$HADOOP_HOME/build/tools" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME/build/tools
fi
第五步:设值业务CLASS执行类以及保存的命令选择对应的启动java类
# figure out which class to run
if [ "$COMMAND" = "classpath" ] ; then
if $cygwin; then
CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi
echo $CLASSPATH
exit
elif [ "$COMMAND" = "namenode" ] ; then
CLASS='org.apache.hadoop.hdfs.server.namenode.NameNode'
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS"
elif [ "$COMMAND" = "secondarynamenode" ] ; then
CLASS='org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode'
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_SECONDARYNAMENODE_OPTS"
elif [ "$COMMAND" = "datanode" ] ; then
CLASS='org.apache.hadoop.hdfs.server.datanode.DataNode'
if [ "$starting_secure_dn" = "true" ]; then
HADOOP_OPTS="$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS"
else
HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"
fi
elif [ "$COMMAND" = "fs" ] ; then
CLASS=org.apache.hadoop.fs.FsShell
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "dfs" ] ; then
CLASS=org.apache.hadoop.fs.FsShell
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "dfsadmin" ] ; then
CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "mradmin" ] ; then
CLASS=org.apache.hadoop.mapred.tools.MRAdmin
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "fsck" ] ; then
CLASS=org.apache.hadoop.hdfs.tools.DFSck
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "balancer" ] ; then
CLASS=org.apache.hadoop.hdfs.server.balancer.Balancer
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_BALANCER_OPTS"
elif [ "$COMMAND" = "fetchdt" ] ; then
CLASS=org.apache.hadoop.hdfs.tools.DelegationTokenFetcher
elif [ "$COMMAND" = "jobtracker" ] ; then
CLASS=org.apache.hadoop.mapred.JobTracker
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS"
elif [ "$COMMAND" = "historyserver" ] ; then
CLASS=org.apache.hadoop.mapred.JobHistoryServer
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOB_HISTORYSERVER_OPTS"
elif [ "$COMMAND" = "tasktracker" ] ; then
CLASS=org.apache.hadoop.mapred.TaskTracker
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_TASKTRACKER_OPTS"
elif [ "$COMMAND" = "job" ] ; then
CLASS=org.apache.hadoop.mapred.JobClient
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "queue" ] ; then
CLASS=org.apache.hadoop.mapred.JobQueueClient
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "pipes" ] ; then
CLASS=org.apache.hadoop.mapred.pipes.Submitter
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "version" ] ; then
CLASS=org.apache.hadoop.util.VersionInfo
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "jar" ] ; then
CLASS=org.apache.hadoop.util.RunJar
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "distcp" ] ; then
CLASS=org.apache.hadoop.tools.DistCp
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "daemonlog" ] ; then
CLASS=org.apache.hadoop.log.LogLevel
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "archive" ] ; then
CLASS=org.apache.hadoop.tools.HadoopArchives
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "sampler" ] ; then
CLASS=org.apache.hadoop.mapred.lib.InputSampler
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
else
CLASS=$COMMAND
fi
第六步、jsvc或者java执行命令
exec "$HADOOP_HOME/libexec/jsvc.${JSVC_ARCH}" -Dproc_$COMMAND -outfile "$HADOOP_LOG_DIR/jsvc.out" \
-errfile "$HADOOP_LOG_DIR/jsvc.err" \
-pidfile "$HADOOP_SECURE_DN_PID" \
-nodetach \
-user "$HADOOP_SECURE_DN_USER" \
-cp "$CLASSPATH" \
$JAVA_HEAP_MAX $HADOOP_OPTS \
echo COMMAND=$COMMAND
echo HADOOP_OPTS=$HADOOP_OPTS
echo CLASS=$CLASS
exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"
fi