1、梯度下降
批量梯度下降(全样本)
随机梯度下降(单个样本),用以在线学习
小批量梯度下降(部分样本)
在线学习:当新数据点到来时,模型即时地进行更新。常用于那些数据持续流入的应用,如金融市场预测、实时广告投放等。
2、linearRegression
岭回归cost函数l2正则化,lasso回归cost函数l1正则化,ElasticNet 结合l1和l2
ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.
p25,p75=df.stat.approxQuantile("Duration",[0.25,0.75],0.0)
df=df.filter((df['Duration']>=p25)&(df['Duration']<=p75))
# df.select("Duration","Start Terminal","End Terminal","Bike #").describe().show()
vector=VectorAssembler(inputCols=['Duration',"Start Terminal","End Terminal"],
outputCol='features')
df=vector.transform(df)
df=df.withColumn('label',df['Bike #']).select('features','label')
# df.show()
(trainingData, testData) = df.randomSplit([0.6, 0.4])
lr=LinearRegression().fit(trainingData)
def modelsummary(model):
import numpy as np
print ("Note: the last rows are the information for Intercept")
print ("##","-------------------------------------------------")
print ("##"," Estimate | Std.Error | t Values | P-value")
coef = np.append(list(model.coefficients),model.intercept)
Summary=model.summary
for i in range(len(Summary.pValues)):
print ("##",'{:10.6f}'.format(coef[i]),
'{:10.6f}'.format(Summary.coefficientStandardErrors[i]),
'{:8.3f}'.format(Summary.tValues[i]),
'{:10.6f}'.format(Summary.pValues[i]))
print ("##",'---')
print ("##","Mean squared error: % .6f" \
% Summary.meanSquaredError, ", RMSE: % .6f" \
% Summary.rootMeanSquaredError )
print ("##","Multiple R-squared: %f" % Summary.r2, ", \
Total iterations: %i"% Summary.totalIterations)
predictions = lr.transform(testData)
predictions.select("label","prediction").show(5)
y_true = predictions.select("label").toPandas()
y_pred = predictions.select("prediction").toPandas()
import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {0}'.format(r2_score))
3、GeneralizedLinearRegression广义线性回归
通过某个激活函数,将线性回归的输出值映射成不同的分布,比如说伯努利分布、泊松分布、伽马分布等。不同的分布对应不同的激活函数,并且通过不同的损失函数进行迭代训练,从而得到不同应用场景下的广义线性回归模型。正态分布和伯努利分布都属于指数分布族,因此线性回归和逻辑回归可以看作是广义线性模型的特例。
核心超参数:family: str = 'gaussian', link: Optional[str] = None
link function的作用是把Y与X间的非线性关系转换成线性关系。
“gaussian” -> “identity”, “log”, “inverse”
“binomial” -> “logit”, “probit”, “cloglog”
“poisson” -> “log”, “identity”, “sqrt”
“gamma” -> “inverse”, “identity”, “log”
“tweedie” -> power link function specified through “linkPower”. The default link power in the tweedie family is 1 - variancePower.
4、DecisionTreeRegressor决策树回归
可给出特征权重
model.stages[1].featureImportances
5、RandomForestRegressor随机森林回归
可给出特征权重
6、GBTRegressor
可给出特征权重