一个简短的python例子
Scikit-Learn是开始使用随机森林的一个很好的方式。scikit-learn API在所以算法中极其的一致,所有你测试和在不同的模型间切换非常容易。很多时候,我从一些简单的东西开始,然后转移到了随机森林。
随机森林在scikit-learn中的实现最棒的特性是n_jobs参数。这将会基于你想使用的核数自动地并行设置随机森林。这里是scikit-learn的贡献者Olivier Grisel的一个很棒的报告,在这个报告中他谈论了使用20个节点的EC2集群训练随机森林。
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = df.columns[:4]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['species'])
clf.fit(train[features], y)
preds = iris.target_names[clf.predict(test[features])]
pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])
转载:
http://www.cnblogs.com/maybe2030/p/4585705.html
http://www.oschina.net/translate/random-forests-in-python?cmp
2、回归
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
import numpy as np
x_train = np.random.uniform(1, 100, 1000).reshape(-1, 1)
y_train = np.log(x_train).reshape(-1, 1) + np.random.normal(0, .3, 1000).reshape(-1, 1)
x_test = np.random.uniform(1, 100, 1000).reshape(-1, 1)
y_test = np.log(x_test).reshape(-1, 1) + np.random.normal(0, .3, 1000).reshape(-1, 1)
def random_forest():
clf = RandomForestRegressor(n_estimators=100,max_features=0.8,oob_score=True,n_jobs=-1,random_state=50,min_samples_leaf =1)
clf.fit(x_train, y_train)
plt.figure()
pred = clf.predict(x_test)
ax = plt.subplot(211)
ax.plot(x_test, y_test, 'b.')
ax.legend(['real'])
bx = plt.subplot(212)
bx.plot(x_test, pred, 'r.')
bx.legend(['pred'])
plt.show()
if __name__ == '__main__':
random_forest()