Pyspark使用DecisionTree回归预测共享单车租赁量_pyspark 共享单车机器学习预测-CSDN博客

本文链接：https://blog.csdn.net/eylier/article/details/105519574

接上篇： Pyspark使用LinearRegressionWithSGD回归预测共享单车租赁量

第一步：下载数据。同上篇，略

第二步：加载数据。同上篇，略

第三步：创建特征向量

def extract_features_dt(record):
	return np.array([float(record[i]) for i in np.arange(2,14)])
data_dt = records.map(lambda r: LabeledPoint(extract_label(r),extract_features_dt(r)))
first_point_dt = data_dt.first()
print("Decision Tree feature vector: " + str(first_point_dt.features))
print("Decision Tree feature vector length: " +str(len(first_point_dt.features)))

第四步：对target变量进行log变换

#对target值进行 log-变换
data_dt_log = data_dt.map(lambda lp:LabeledPoint(np.log(lp.label), lp.features))

第五步：拆分训练、测试集

#拆分训练样本-测试样本
data_with_idx_dt = data_dt.zipWithIndex().map(lambda point_index: (point_index[1], point_index[0]))
test_dt = data_with_idx_dt.sample(False, 0.2, 42)  
train_dt = data_with_idx_dt.subtractByKey(test_dt) 
train_data_dt = train_dt.map(lambda index_point:index_point[1])
test_data_dt = test_dt.map(lambda index_point:index_point[1])

第六步：定义模型评价度量

def evaluate_dt(train, test, maxDepth, maxBins):
	model = DecisionTree.trainRegressor(train, {},impurity='variance', maxDepth=maxDepth, maxBins=maxBins)
	preds = model.predict(test.map(lambda p: p.features))
	actual = test.map(lambda p: p.label)
	tp = actual.zip(preds)
	rmsle = np.sqrt(tp.map(lambda ap: squared_log_error(ap[0],ap[1])).mean())
	return rmsle

第七步：调节参数，可调节参数包括： maxDepth, bins

#调节参数--树的深度，maxDepth
params = [1, 2, 3, 4, 5, 10, 20]
metrics = [evaluate_dt(train_data_dt, test_data_dt, param, 32) for param in params]
print( params)
print( metrics)
plt.plot(params, metrics)
fig = plt.gcf()