size not match(label size和 predict size )

XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'

 

I am training a XGBoostClassifier for my training set.

My training features are in the shape of (45001, 10338) which is a numpy array and my training labels are in the shape of (45001,) [I have 1161 unique labels so I have done a label encoding for the labels] which is also a numpy array.

From the documentation, it clearly says that I can create DMatrix from numpy array. So I am using the above mentioned training features and labels as numpy arrays straightaway. But I am getting the following error

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-30-3de36245534e> in <module>()
     13  scale_pos_weight=1,
     14  seed=27)
---> 15 modelfit(xgb1, train_x, train_y)

<ipython-input-27-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
      6         xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
      7         cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8             metrics='auc',early_stopping_rounds=early_stopping_rounds)
      9         alg.set_params(n_estimators=cvresult.shape[0])
     10 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
    399         for fold in cvfolds:
    400             fold.update(i, obj)
--> 401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 
    403         for key, mean, std in res:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in <listcomp>(.0)
    399         for fold in cvfolds:
    400             fold.update(i, obj)
--> 401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 
    403         for key, mean, std in res:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in eval(self, iteration, feval)
    221     def eval(self, iteration, feval):
    222         """"Evaluate the CVPack for one iteration."""
--> 223         return self.bst.eval_set(self.watchlist, iteration, feval)
    224 
    225 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in eval_set(self, evals, iteration, feval)
    865             _check_call(_LIB.XGBoosterEvalOneIter(self.handle, iteration,
    866                                                   dmats, evnames, len(evals),
--> 867                                                   ctypes.byref(msg)))
    868             return msg.value
    869         else:

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'

Please find my model Code below:

def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgb_param['num_class'] = 1161   
        xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc',early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])

    #Fit the algorithm on the data
    alg.fit(train_data_features, train_labels, eval_metric='auc')

    #Predict training set:
    dtrain_predictions = alg.predict(train_data_features)
    dtrain_predprob = alg.predict_proba(train_data_features)[:,1]

    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(train_labels, dtrain_predictions))

Where am I going wrong in the above place ?

My classifier as follows :

xgb1 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=50,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective='multi:softmax',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

EDIT - 2 After changing evaluation metric,

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-9-30c62a886c2e> in <module>()
     13  scale_pos_weight=1,
     14  seed=27)
---> 15 modelfit(xgb1, train_x_trail, train_y_trail)

<ipython-input-8-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
      6         xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
      7         cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8             metrics='auc',early_stopping_rounds=early_stopping_rounds)
      9         alg.set_params(n_estimators=cvresult.shape[0])
     10 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
    398                            evaluation_result_list=None))
    399         for fold in cvfolds:
--> 400             fold.update(i, obj)
    401         res = aggcv([f.eval(i, feval) for f in cvfolds])
    402 

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in update(self, iteration, fobj)
    217     def update(self, iteration, fobj):
    218         """"Update the boosters for one iteration"""
--> 219         self.bst.update(self.dtrain, iteration, fobj)
    220 
    221     def eval(self, iteration, feval):

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
    804 
    805         if fobj is None:
--> 806             _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
    807         else:
    808             pred = self.predict(dtrain)

/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
    125     """
    126     if ret != 0:
--> 127         raise XGBoostError(_LIB.XGBGetLastError())
    128 
    129 

XGBoostError: b'[03:43:03] src/objective/multiclass_obj.cc:42: Check failed: (info.labels.size()) != (0) label set cannot be empty'

python numpy xgboost

shareimprove this question

edited Aug 4 '17 at 3:45

asked Jul 23 '17 at 4:56

Kathiravan Natarajan

6172830

add a comment

====================================================================================================================================================================================================================================================================================================================================================================

2 Answers

active oldest votes

up vote 5 down vote accepted

+50

The original error that you get is because this metric was not designed for multi-class classification (see here).

You could use scikit learn wrapper of xgboost to overcome this issue. I modified your code with this wrapper, to produce similar function. I am not sure why are you doing gridsearch though, as you are not enumerating over parameters. Instead, you are using the parameters you specified in xgb1. Here is the modified code:

import xgboost as xgb
import sklearn
import numpy as np
from sklearn.model_selection import GridSearchCV

def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5):

    if useTrainCV:
        params=alg.get_xgb_params()
        xgb_param=dict([(key,[params[key]]) for key in params])

        boost = xgb.sklearn.XGBClassifier()
        cvresult = GridSearchCV(boost,xgb_param,cv=cv_folds)
        cvresult.fit(X,y)
        alg=cvresult.best_estimator_


    #Fit the algorithm on the data
    alg.fit(train_data_features, train_labels)

    #Predict training set:
    dtrain_predictions = alg.predict(train_data_features)
    dtrain_predprob = alg.predict_proba(train_data_features)[:,1]

    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % sklearn.metrics.accuracy_score(train_labels, dtrain_predictions))

xgb1 = xgb.sklearn.XGBClassifier(
 learning_rate =0.1,
 n_estimators=50,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective='multi:softmax',
 nthread=4,
 scale_pos_weight=1,
 seed=27)    


X=np.random.normal(size=(200,30))
y=np.random.randint(0,5,200)

modelfit(xgb1, X, y)

The output that I get is

Model Report
Accuracy : 1

Note that I used much smaller size for the data. With the size that you mentioned, the algorithm may be very slow.

shareimprove this answer

answered Aug 4 '17 at 15:41

Miriam Farber

10.9k82746

  • In tensorflow, we create batches and run them. can I run this algorithm in batch wise ? Let's say 100 records after another ? How can I save this model and train it again ? I will accept your answer – Kathiravan Natarajan Aug 5 '17 at 0:21

  • When you train neural network on tensorflow you use batch gradient descent. Thus you can do that in chunks. However, xgboost operates differently, so you cannot just separate it into chunks. However, I looked at xgboost faq page: xgboost.readthedocs.io/en/latest/faq.html, and in the section about large data sets they write this: XGBoost is designed to be memory efficient. Usually it can handle problems as long as the data fit into your memory (This usually means millions of instances). If you are running out of memory, checkout external memory version or distributed version of xgboost – Miriam Farber Aug 5 '17 at 8:45

  • Thus, based on the above quote, it seems that you can try to run the code on your computer as it is. You can also put verbose=2 in GridSearchCV so that it will print more details while it's running. If it won't work, you could try the distributed version. They give a link to it from the faq page (the one I linked to in the previous comment). You could also put useTrainCV=False. As you have one set of parameters, you don't really need the gridsearch, so you can skip that part of your code (which is currently the most heavy part in your code). – Miriam Farber Aug 5 '17 at 9:00

add a comment

====================================================================================================================================================================================================================================================================================================================================================================

up vote 2 down vote

The error is b/c you are trying to use AUC evaluation metric for multiclass classification, but AUC is only applicable for two-class problems. In xgboost implementation, "auc" expects prediction size to be the same as label size, while your multiclass prediction size would be 45001*1161. Use either "mlogloss" or "merror" multiclass metrics.

P.S.: currently, xgboost would be rather slow with so many classes, as there is some inefficiency with predictions caching during training.

shareimprove this answer

answered Aug 3 '17 at 2:59

Vadim Khotilovich

46328

  • Please check the new error above after changing the evaluation metric – Kathiravan Natarajan Aug 4 '17 at 3:44

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值