Troubleshooting
已经在python中安装了tensorflow,并且zip包也已经提交到hdfs,运行后报错:
no module named tensorflow
可能是未正常使用zip包中的python,而使用了系统中其他版本的python
在启动命令中加入以下选项,明确指示使用哪个python
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python \
--conf spark.pyspark.python=./Python/bin/python \
--conf spark.pyspark.driver.python=./Python/bin/python \
java.lang.ClassNotFoundException: org.tensorflow.hadoop.io.TFRecordFileOutputFormat
solution:
git clone https://github.com/tensorflow/ecosystem.git
cd ecosystem
cp -r ecosystem/hadoop /tmp
cd /tmp/hadoop
mvn clean package
cd target/
hadoop fs -put tensorflow-hadoop-1.10.0.jar
then add follow config to your submit command:
--jars hdfs:///user/${USER}/tensorflow-hadoop-1.10.0.jar \
2021-04-14 15:55:10,148 ERROR (Thread-3-41283) Exception in TF background thread
AttributeError: Can't pickle local object 'start.<locals>.<lambda>'
暂时未找到原因,怀疑是mac的中multiprocess库的bug,类似这个issue and this discussion
Failed to download resource { { hdfs://centos7:9000/user/zhangchen/python38.zip, 1618426581429, ARCHIVE, null },pending,[(container_1618449935987_0002_01_000001)],409455709874,DOWNLOADING} ENOENT: No such file or directory
solution:
add
export LIB_HDFS=$HADOOP_HOME/lib/native
export LIB_JVM=$JAVA_HOME/lib/server
to your ~/.bashrc
and source ~/.bashrc
then add --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS \
to your submit command.
if above steps don’t work, use hadoop fs -ls /user/zhangchen/python38.zip
to check the filesize and use hadoop fs -get /user/zhangchen/python38.zip
to downloaded it to your local and check it’s content.
What I found is that it’s an empty directory tree, but doesn’t have any file.
I just re-compress the zip and upload it to hdfs to get this problem solved!!!
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
when I check the node manager log by the timestamp, I got the following error in that time
2021-04-15 11:30:30,878 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /usr/local/share/data/hadoop/nm-local-dir : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /opt/hadoop-3.2.2/logs/userlogs : used space above threshold of 90.0% ]
reason: df -h
show that disk usage of virtual machine is above 90%.
solution: migrate hadoop files to host shared dir
on host
mkdir ~/virtual_machines/vmdata
share the vmdata to VM
on VM
$HADOOP_HOME/sbin/stop-all.sh
mv /usr/local/share/data/hadoop /mnt/hgfs/vmdata/
sed -i 's#/usr/local/share/data#/mnt/hgfs/vmdata#g' $HADOOP_HOME/etc/hadoop/core-site.xml
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
use df -h
to check again, ensure disk usage is lowed down.
exit code 137
reason: manual intervention or oom
if not killed by some guy, check the memory config and resubmit the job.
Exception: Timeout while feeding partition
When this error occurred, I checked the container log and also find following error:
2021-04-15 16:22:33.646363: E tensorflow/core/platform/hadoop/hadoop_file_system.cc:115] HadoopFileSystem load error: libjvm.so: cannot open shared object file: No such file or directory
so I just add $LIB_JVM
to --conf spark.executorEnv.LD_LIBRARY_PATH
, and everything gets ok.
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS \
another time,when I go through the log,I found the problem is because that HDFS has entered standby mode.
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
reason: I misused the TfRecord api to read csv files, when I using tfr files, everything fine.
import pyspark
report error:
cloudpickle.py line 127 in _make_cell_set_template_code
TypeError: an integer is required (got type bytes)
reason: maybe the version of pyspark shipped with spark 2.3.3 is not compatible with my python version 3.8.8 as described here
solution: I’m not willing to downgrade my python version (because life is short), and since the hadoop version in my company is so old(2.6.0), so I decide to use Hadoop free build of Spark and then run the following command
echo 'export SPARK_DIST_CLASSPATH=$(hadoop classpath)' >> $SPARK_HOME/conf/spark-env.sh
SparkException: When running with master ‘yarn’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >> $SPARK_HOME/conf/spark-env.sh
Exception: No executor_id file found on this node
I add --conf spark.cores.max=${TOTAL_CORES}
to your submit command, as said by the author of TFoS, but the problem still here.
I checked the log again, and find an error before said:
Lost executor 1 on node3: Container killed by YARN for execeeding physical memory limits.
so I boosting the --conf spark.executor.memoryOverhead to 2G and things get done.