个人心得(CDH5.14,心得是对下文转载步骤的补充):
CDH5.14的config.mk
config.mk的配置要改成如下:
USE_HDFS = 1
HDFS_LIB_PATH = /home/user/xgboost/xgboost-package/libhdfs/lib
HADOOP_HOME = /opt/cloudera/parcels/CDH
HADOOP_HDFS_HOME = /opt/cloudera/parcels/CDH
环境变量
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH
修改yarn.py
编辑 xgboost-package/dmlc-core/tracker/dmlc_tracker/yarn.py,
在48行修改:
out = out.decode('utf-8').split('\n')[0].split()
编译dmlc-yarn.jar
No FileSystem for scheme: hdfs 错误
修改xgboost目录下的文件/xgboost-package/dmlc-core/tracker/yarn/src/main/java/org/apache/hadoop/yarn/dmlc/Client.java
/**
* constructor
* @throws IOException
*/
private Client() throws IOException {
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()); // add this
conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());// add this
conf.addResource(new Path(System.getenv("HADOOP_CONF_DIR") +"/core-site.xml"));
conf.addResource(new Path(System.getenv("HADOOP_CONF_DIR") +"/hdfs-site.xml"));
dfs = FileSystem.get(conf);
userName = UserGroupInformation.getCurrentUser().getShortUserName();
credentials = UserGroupInformation.getCurrentUser().getCredentials();
}
执行命令行
cd {HOME}/xgboost/dmlc-core/tracker/yarn/
hadoop jar dmlc-yarn.jar org.apache.hadoop.yarn.dmlc.Client ../../../xgboost -DMLC_WORKER_CORES 2 -file dmlc-yarn.jar
以下内容转自:https://samperson1997.github.io/2018/05/16/hadoop-xgboost/
xgboost分布式部署
下载xgboost
[1] clone最新版xgboost:git clone --recursive https://github.com/dmlc/xgboost
[2] 在${HOME}
创建xgboost-package目录,将xgboost拷贝到其下:
mkdir xgboost-package
安装编译依赖的包
[1] 安装gcc:sudo yum install gcc
[2] 由于yum是最小安装,因此还需要安装编译C++的包:yum install gcc-c++
[3] 安装cmake:sudo yum install cmake
[4] 安装jdk:sudo yum install java-1.7.0-openjdk-devel
注意一定要安装openjdk,安装oracle的jdk可能会导致后面找不到包;
另外要安装devel版本,这个版本包括jre和jvm。
[5] 安装python:centos自带python2.x相关包。
下载编译libhdfs
[1] 下载hadoop-common-cdh5-2.6.0_5.5.0:https://github.com/cloudera/hadoop-common/tree/cdh5-2.6.0_5.5.0/
[2] 解压:unzip hadoop-common-cdh5-2.6.0_5.5.0.zip
[3] 到src目录下:cd hadoop-common-cdh5-2.6.0_5.5.0/hadoop-hdfs-project/hadoop-hdfs/src
[4] cmake:cmake -DGENERATED_JAVAH=/opt/jdk1.8.0_60 -DJAVA_HOME=/opt/jdk1.8.0_60
[5] make:make
[6] 拷贝编译好的目标文件到xgboost-package中,两个位置都需要:
cp -r target/usr/local/lib ${HOME}/xgboost/libhdfs
编译xgboost
[1] 进入xgboost目录,拷贝配置文件:cp make/config.mk config.mk
[2] 更改config.mk使用HDFS配置,可以修改一下$(HOME)
:
USE_HDFS = 1
HADOOP_HOME = /usr/local/hadoop
HDFS_LIB_PATH = $(HOME)/xgboost-package/libhdfs
[3] 编译:make -j4
,可能需要消耗一段时间
测试
[1] 修改错误文件
当前版本xgboost代码中存在错误,修改 $(HOME)/xgboost-package/xgboost/dmlc-core/tracker/dmlc_tracker/submit.py,注释掉其中包含kubernetes的代码行(第11、53、54行)
[2] 创建data文件夹:hadoop dfs -mkdir -p /xgboost/data
[3] 上传文件:
cd $(HOME)/xgboost-package
hadoop -put ./xgboost/demo/data/agaricus.txt.test /xgboost/data
hadoop -put ./xgboost/demo/data/agaricus.txt.train /xgboost/data
[4] 进入测试目录:cd xgboost/dmlc-core/tracker
[5] 提交
dmlc-submit —cluster=yarn —num-workers=2 —worker-cores=2\
> ../../xgboost mushroom.aws.conf nthread=2\
> data=///xgboost/data/agaricus.txt.train\
> eval[test]=hdfs://0.0.0.0:9000/xgboost/data/agaricus.txt.test\
> model_dir=hdfs://0.0.0.0:9000/xgboost/model_new
参考资料
[1] 手把手教你安装mac版hadoop2.7.3教程 https://www.cnblogs.com/landed/p/6831758.html
[2] xgboost分布式部署教程 https://blog.csdn.net/u010306433/article/details/51403894
[3] Distributed XGBoost YARN on AWS http://xgboost.readthedocs.io/en/latest/tutorials/aws_yarn.html