I am trying to load training and test data from a csv, run the random forest regressor in scikit/sklearn, and then predict the output from the test file.
The TrainLoanData.csv file contains 5 columns; the first column is the output and the next 4 columns are the features. The TestLoanData.csv contains 4 columns - the features.
When I run the code, I get error:
predicted_probs = ["%f" % x[1] for x in predicted_probs]
IndexError: invalid index to scalar variable.
What does this mean?
Here is my code:
import numpy, scipy, sklearn, csv_io //csv_io from https://raw.github.com/benhamner/BioResponse/master/Benchmarks/csv_io.py
from sklearn import datasets
from sklearn.ensemble import RandomForestRegressor
def main():
#read in the training file
train = csv_io.read_data("TrainLoanData.csv")
#set the training responses
target = [x[0] for x in train]
#set the training features
train = [x[1:] for x in train]
#read in the test file
realtest = csv_io.read_data("TestLoanData.csv")
# random forest code
rf = RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)
# fit the training data
print('fitting the model')
rf.fit(train, target)
# run model against test data
predicted_probs = rf.predict(realtest)
print predicted_probs
predicted_probs = ["%f" % x[1] for x in predicted_probs]
csv_io.write_delimited_file("random_forest_solution.csv", predicted_probs)
main()
解决方案
The return value from a RandomForestRegressor is an array of floats:
In [3]: rf = RandomForestRegressor(n_estimators=10, min_samples_split=2, n_jobs=-1)
In [4]: rf.fit([[1,2,3],[4,5,6]],[-1,1])
Out[4]:
RandomForestRegressor(bootstrap=True, compute_importances=False,
criterion='mse', max_depth=None, max_features='auto',
min_density=0.1, min_samples_leaf=1, min_samples_split=2,
n_estimators=10, n_jobs=-1, oob_score=False,
random_state=,
verbose=0)
In [5]: rf.predict([1,2,3])
Out[5]: array([-0.6])
In [6]: rf.predict([[1,2,3],[4,5,6]])
Out[6]: array([-0.6, 0.4])
So you're trying to index a float like (-0.6)[1], which is not possible.
As a side note, the model does not return probabilities.