之前用过python下的xgboost,现在想在自己的电脑(os)上折腾下jave版本的xgboost,碰到不少坑,记录下,
1.下载xgboost库
git clone --recursive https://github.com/dmlc/xgboost
2.编译xgboost
查看自己电脑上是否有g++ gcc,在/usr/lib下查看,如果没有,则要安装g++,gcc
brew install gcc --without-multilib
这个过程时间很长,我电脑上装了两个小时
然后
ls /usr/local/bin/*
2.1官网提供两种编译xgboost的方式,在mac上,一种支持多线程,另外一种不支持多线程
支持多线程用下面方式进行编译:
cd xgboost
cp make/minimum.mk ./config.mk
make -j4
不支持多线程的编译方式:
cd xgboost
cp make/config.mk ./config.mk
make -j4
若成功安装了gcc的,应该能顺利编译成功,如果报clang: error: : errorunsupported option '-fopenmp' 这种错误,
则在config.mk文件中加入
$ export CC=/usr/local/bin/gcc-7
$ export CC=/usr/local/bin/g++-7
我gcc已经到7了
3.1安装python xgboost,则如下:
cd python-package; sudo python setup.py install
在我的电脑上可以成功:
可以正常使用,测试代码:
import numpy as np
import xgboost as xgb
data = np.loadtxt('train.csv', delimiter=',',converters={14: lambda x:int(x == '?'), 15: lambda x:int(x) } )
sz = data.shape
np.random.shuffle(data) #数据随机打乱,测试数据否则抽取的全部是0
train = data[:int(sz[0] * 0.7), :]
test = data[int(sz[0] * 0.7):, :]
train_X = train[:,0:14]
train_Y = train[:, 15]
print(type(train_Y))
test_X = test[:,0:14]
test_Y = test[:, 15]
xg_train = xgb.DMatrix( train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)
params={
'booster':'gbtree',
'objective':'binary:logistic',
'early_stopping_rounds':100,
'scale_pos_weight':1,
'eval_metric':'auc',
'gamma':'0.1',
'max_depth':8,
'lambda':550,
'subsample':0.7,
'colsample_bytree':0.4,
'min_child_weight':3,
'eta':0.02,
'seed':27,
'nthread':7,
}
watchlist = [ (xg_train,'train'), (xg_test, 'test') ]
xgboost_model = xgb.train(params,xg_train,num_boost_round=3000,evals=watchlist)
xgboost_model.save_model('xgb.model')
pred= xgboost_model.predict(xg_test)
print(pred)
4.安装java版本的xgboost
mvn package
如果报如下错误:
Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.8.0:check (checkstyle)
on project xgboost-jvm: Execution checkstyle of goal org.scalastyle:scalastyle-maven-plugin:0.8.0:check
failed: A required class was missing while executing org.scalastyle:scalastyle-maven-plugin:0.8.0:check: scala/xml/Node
则把jvm-package目录下pom.xml文件下的插件注释既可以,:
<!-- <plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-checkstyle-plugin</artifactId>
<version>2.17</version>
<configuration>
<configLocation>checkstyle.xml</configLocation>
<failOnViolation>true</failOnViolation>
</configuration>
<executions>
<execution>
<id>checkstyle</id>
<phase>validate</phase>
<goals>
<goal>check</goal>
</goals>
</execution>
</executions>
</plugin> -->
5.改变scala版本的编译
在这里改变既可以:
<properties>
<spark.version>2.0.1</spark.version>
<flink.suffix>_2.11</flink.suffix>
<scala.version>2.10.6</scala.version>
<scala.binary.version>2.10</scala.binary.version>
</properties>
再进行打包用maven clean install 打包,不出意外的话在xgboost4j下面会生成有两个jar包,一个是单纯的xgboost,一个是含有依赖的
含有依赖的两个jar包增加下面两个jar包
6.xgboost依赖的两个jar包
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.4</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.2</version>
</dependency>
7.将编译好的jar包安装到maven仓库里面去
mvn install:install-file -Dfile=xgboost4j-0.7-jar-with-dependencies.jar -DgroupId=ml.dmlc -DartifactId=xgboost4j -Dversion=0.7 -Dpackaging=jar
8.在自己的maven项目中添加xgboot4j依赖,并做测试
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>0.7</version>
</dependency>
package com.meituan.model.xgboost;
import java.util.HashMap;
import java.util.ArrayList;
import java.util.List;
import java.util.Arrays;
import java.util.Map;
import ml.dmlc.xgboost4j.java.Booster;
import ml.dmlc.xgboost4j.java.DMatrix;
import ml.dmlc.xgboost4j.java.XGBoost;
import ml.dmlc.xgboost4j.java.XGBoostError;
public class PredictFirstNtree {
private static String path = "/Users/shuubiasahi/Documents/workspace/xgboost/demo/data/";
private static String trainString = "agaricus.txt.train";
private static String testString = "agaricus.txt.test";
public static void main(String[] args) throws XGBoostError {
DMatrix trainMat = new DMatrix(path + trainString);
DMatrix testMat = new DMatrix(path + testString);
// specify parameters
Map<String, Object> params = new HashMap<String, Object>();
params.put("eta", 1.0);
params.put("max_depth", 2);
params.put("silent", 1);
params.put("objective", "binary:logistic");
// specify watchList
HashMap<String, DMatrix> watches = new HashMap<String, DMatrix>();
watches.put("train", trainMat);
watches.put("test", testMat);
// train a booster
int round = 3;
Booster booster = XGBoost.train(trainMat, params, round, watches, null,
null);
// predict using first 2 tree
float[][] leafindex = booster.predictLeaf(testMat, 2);
for (float[] leafs : leafindex) {
System.out.println(Arrays.toString(leafs));
}
// predict all trees
leafindex = booster.predictLeaf(testMat, 0);
for (float[] leafs : leafindex) {
System.out.println(Arrays.toString(leafs));
}
}
}
[5.0, 4.0, 5.0]
[3.0, 3.0, 3.0]
[5.0, 4.0, 5.0]
[3.0, 3.0, 3.0]
[0] test-error:0.042831 train-error:0.046522
[1] test-error:0.021726 train-error:0.022263
[2] test-error:0.006207 train-error:0.007063
如果看完还不会弄,可以私我。