问题:
- 在公司部署好hadoop,使用停止集群脚本时,发现datanode以及nodemanager进程关不掉
表面原因:
- 在hadoop-env.sh中设置了pid的保存路径
官方文档:
HADOOP_PID_DIR
- The directory where the daemons’ process id files are stored.- In most cases, you should specify the
HADOOP_PID_DIR
andHADOOP_LOG_DIR
directories such that they can only be written to by the users that are going to run the hadoop daemons. Otherwise there is the potential for a symlink attack.
1.首先看下启动脚本start-dfs.sh内容
# namenodes
NAMENODES=$("${HADOOP_HDFS_HOME}/bin/hdfs" getconf -namenodes 2>/dev/null)
if [[ -z "${NAMENODES}" ]]; then
NAMENODES=$(hostname)
fi
echo "Starting namenodes on [${NAMENODES}]"
hadoop_uservar_su hdfs namenode "${HADOOP_HDFS_HOME}/bin/hdfs" \
--workers \
--config "${HADOOP_CONF_DIR}" \
--hostnames "${NAMENODES}" \
--daemon start \
namenode ${nameStartOpt}
HADOOP_JUMBO_RETCOUNTER=$?
#---------------------------------------------------------
# datanodes (using default workers file)
echo "Starting datanodes"
hadoop_uservar_su hdfs datanode "${HADOOP_HDFS_HOME}/bin/hdfs" \
--workers \
--config "${HADOOP_CONF_DIR}" \
--daemon start \
datanode ${dataStartOpt}
(( HADOOP_JUMBO_RETCOUNTER=HADOOP_JUMBO_RETCOUNTER + $? ))
可以看到时根据$HADOOP_CONF_DIR中的配置来启动进程的
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
2.再观察stop-dfs.sh文件内容
# namenodes
NAMENODES=$("${HADOOP_HDFS_HOME}/bin/hdfs" getconf -namenodes 2>/dev/null)
if [[ -z "${NAMENODES}" ]]; then
NAMENODES=$(hostname)
fi
echo "Stopping namenodes on [${NAMENODES}]"
hadoop_uservar_su hdfs namenode "${HADOOP_HDFS_HOME}/bin/hdfs" \
--workers \
--config "${HADOOP_CONF_DIR}" \
--hostnames "${NAMENODES}" \
--daemon stop \
namenode
#---------------------------------------------------------
# datanodes (using default workers file)
echo "Stopping datanodes"
hadoop_uservar_su hdfs datanode "${HADOOP_HDFS_HOME}/bin/hdfs" \
--workers \
--config "${HADOOP_CONF_DIR}" \
--daemon stop \
datanode
可以看到,与启动脚本类似,也是通过配置文件中的参数来关闭进程
其中,与关闭进程最相关的配置文件中的参数就是
$HADOOP_PID_DIR=$HADOOP_HOME/pids
当我们启动后,hadoop底层将pid写入到配置的路径
$ ll pids
total 12
-rw-r--r-- 1 dtcot021 dtco 6 Apr 20 09:15 hadoop-dtcot021-datanode.pid
-rw-r--r-- 1 dtcot021 dtco 7 Apr 20 09:15 hadoop-dtcot021-namenode.pid
-rw-r--r-- 1 dtcot021 dtco 6 Apr 20 09:15 hadoop-dtcot021-secondarynamenode.pid
观察两台机器pid路径下的内容,发现仅有master主机上datanode的pid
根本原因:
- 由于公司里面服务器存储是基于NFS共享存储,所以相当于两台机器往同一个目录下写入pid。所以使用停止脚本关闭时,会导致读取失败,从而导致两个datanode都无法关闭。
解决方案:
- 取消掉我们的pid路径设置,恢复默认存储
- 或者配置为非共享存储的路径