Gradient Boosting Classifier sparse matrix issue using pandas and scikit

I have been using the following code to do multiclass classification which uses GradientBoostingClassifier from scikit-learn. I am facing a known issue with sparse matrix Conversion to dense matrix.

I have applied the following solution stackoverflow but it doesnt work for my case. Although the solution I used is meant for RandomForestClassifier but AFAIK it should work for GradientBoostingClassifier!

Also to add this code works perfectly if I replace GradientBoostingClassifier with RandomForestClassifier.

The data in this case is numeric 93 features with 8 target classes. The data can be fetched fromKaggle

# load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
sample = pd.read_csv('submissions/sampleSubmission.csv')
labels = train.target.values
ids = train.id.values
train = train.drop('id', axis=1)
train = train.drop('target', axis=1)
train_orig = train
test = test.drop('id', axis=1)

# transform counts to TFIDF features
tfidf = feature_extraction.text.TfidfTransformer()
train = tfidf.fit_transform(train)
test = tfidf.transform(test).toarray() # Update line

# encode labels 
lbl_enc = preprocessing.LabelEncoder()
labels = lbl_enc.fit_transform(labels)

# train a random forest classifier
print('starting training ... ')
clf = ensemble.GradientBoostingClassifier( n_estimators=config.estimators)
clf.fit(train, labels)

# predict on test set
print('starting prediction ... ')
preds = clf.predict_proba(test) # Error on this line even when test is dense
train_pred = clf.predict(tfidf.transform(train_orig))

Traceback:

python boosted_trees.py 
starting training ... 
Traceback (most recent call last):
  File "boosted_trees.py", line 57, in <module>
    clf.fit(train, labels)
  File "/usr/local/lib/python2.7/site-        packages/sklearn/ensemble/gradient_boosting.py", line 941, in fit
    X, y = check_X_y(X, y, dtype=DTYPE)
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 439, in check_X_y
    ensure_min_features)
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py",     line 331, in check_array
    copy, force_all_finite)
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py",     line 239, in _ensure_sparse_format
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use         X.toarray() to convert to a dense numpy array.ere
【解决方法】

Just in case anyone needs. The issue is in these lines.

train = tfidf.fit_transform(train)
test = tfidf.transform(test).toarray() # Update line

Both lines should have a toarray() to fix this.

train = tfidf.fit_transform(train).toarray()
test = tfidf.transform(test).toarray() # Update line

转自:http://stackoverflow.com/questions/29498106/gradient-boosting-classifier-sparse-matrix-issue-using-pandas-and-scikit

  • 3
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值