本文参考skearn官网教程,链接如下:
https://scikit-learn.org/stable/modules/linear_model.html#lasso
我们都知道:ridge是l2正则化的线性回归,lasso则是带l1正则化的线性回归。进一步说,他们都同样的比线性回归多一个超参数需要调,alpha。所以有了RidgeCV,LassoCV的说法。也就是说我们必须找到合理的alpha,那么这个线性模型我们才能说是找好了。
所以我建议在用这两个模型时,尽量都用CV形式的,而不是用Lasso与Ridge。
1.RidgeCV:
from sklearn import linear_model
reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0], cv=3)
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
print(reg.alpha_)
??RidgeCV
#部分输出
Init signature: RidgeCV(alphas=(0.1, 1.0, 10.0), fit_intercept=True, normalize=False, scoring=None, cv=None, gcv_mode=None, store_cv_values=False)
Source:
class RidgeCV(_BaseRidgeCV, RegressorMixin):
"""Ridge regression with built-in cross-validation.
By default, it performs Generalized Cross-Validation, which is a form of
efficient Leave-One-Out cross-validation.
Read more in the :ref:`User Guide <ridge_regression>`.
Parameters
----------
alphas : numpy array of shape [n_alphas]
Array of alpha values to try.
Regularization strength; must be a positive float. Regularization
improves the conditioning of the problem and reduces the variance of
the estimates. Larger values specify stronger regularization.
Alpha corresponds to ``C^-1`` in other linear models such as
LogisticRegression or LinearSVC.
fit_intercept : boolean
Whether to calculate the intercept for this model. If set
to false, no intercept will be used in calculations
(e.g. data is expected to be already centered).
normalize : boolean, optional, default False
This parameter is ignored when ``fit_intercept`` is set to False.
If True, the regressors X will be normalized before regression by
subtracting the mean and dividing by the l2-norm.
If you wish to standardize, please use
:class:`sklearn.preprocessing.StandardScaler` before calling ``fit``
on an estimator with ``normalize=False``.
scoring : string, callable or None, optional, default: None
A string (see model evaluation documentation) or
a scorer callable object / function with signature
``scorer(estimator, X, y)``.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the efficient Leave-One-Out cross-validation
- integer, to specify the number of folds.
- An object to be used as a cross-validation generator.
- An iterable yielding train/test splits.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`sklearn.model_selection.StratifiedKFold` is used, else,
:class:`sklearn.model_selection.KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validation strategies that can be used here.
根据上面提示:我们知道默认的cv是采用loo留一法做cv的,这样比较耗时。我们可以修改cv。例如:
from slkearn.model_selection import KFold,ShuffleSplit
kfold = KFold(n_splits=3)
shuffle_split = ShuffleSplit(test_size=.5,n_splits=10)
ridge1 = RidfeCV(alphas = [.1,1,10,100],cv = kfold)
ridge2 = RidfeCV(alphas = [.1,1,10,100],cv = shuffle_split)
关于alpha与linearSVM中C的关系:
其实是很容易从数学式子看出来的。
2.LassoCV:
2.1fundamental
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.1)
reg.fit([[0, 0], [1, 1]], [0, 1])
reg.predict([[1, 1]])
2.2LassoCV
与RidgeCV类似有很多用法,参见ridgecv
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
X, y = make_regression(noise=4, random_state=0)
reg = LassoCV(cv=5, random_state=0).fit(X, y)
reg.score(X, y)
reg.predict(X[:1,])
2.3Lasso权重可视化,该方法同样可以用于ridge。
下面代码少了import一些库,知道意思就行。需要用的话自己添上即可。
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandrdScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0,test_size=0.15)
scaler2 = StandardScaler()
X_train = scaler2.fit_transform(X_train)
X_test = scaler2.transform(X_test)
print("ss处理之后的数据",X_test)
linreg = LassoCV()
linreg.fit(X_train, y_train)
print('最佳的alpha:',linreg.alpha_)
print("----------------------------")
print("截距:",linreg.intercept_)
print("回归系数:",linreg.coef_)
y_pred4 = linreg.predict(X_test)
coef = pd.Series(linreg.coef_, index = X_train.columns)
imp_coef = pd.concat([coef.sort_values().head(10), coef.sort_values().tail(10)])
#选头尾各10条,.sort_values() 可以将某一列的值进行排序。
#matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
plt.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
plt.show()
关于代码中的
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
其实:
plt.rcParams['figure.figsize'] = (8.0, 4.0) # 设置figure_size尺寸
plt.rcParams['image.interpolation'] = 'nearest' # 设置 interpolation style
plt.rcParams['image.cmap'] = 'gray' # 设置 颜色 style
以及:
#figsize(12.5, 4) # 设置 figsize
plt.rcParams['savefig.dpi'] = 300 #图片像素
plt.rcParams['figure.dpi'] = 300 #分辨率
# 默认的像素:[6.0,4.0],分辨率为100,图片尺寸为 600&400
# 指定dpi=200,图片尺寸为 1200*800
# 指定dpi=300,图片尺寸为 1800*1200
# 设置figsize可以在不改变分辨率情况下改变比例
上面关于尺寸说明参考:
https://blog.csdn.net/NockinOnHeavensDoor/article/details/80565764
关于代码中的
imp_coef.plot(kind = "barh")
https://blog.csdn.net/hustqb/article/details/54410670#fnref:1
https://blog.csdn.net/jinlong_xu/article/details/70175107
官网的DataFrame.plot( )函数
DataFrame.plot(x=None, y=None, kind='line', ax=None, subplots=False,
sharex=None, sharey=False, layout=None,figsize=None,
use_index=True, title=None, grid=None, legend=True,
style=None, logx=False, logy=False, loglog=False,
xticks=None, yticks=None, xlim=None, ylim=None, rot=None,
xerr=None,secondary_y=False, sort_columns=False, **kwds)
kind : str
‘line’ : line plot (default)#折线图
‘bar’ : vertical bar plot#条形图
‘barh’ : horizontal bar plot#横向条形图
‘hist’ : histogram#柱状图
‘box’ : boxplot#箱线图
‘kde’ : Kernel Density Estimation plot#Kernel 的密度估计图,主要对柱状图添加Kernel 概率密度线
‘density’ : same as ‘kde’
‘area’ : area plot#不了解此图
‘pie’ : pie plot#饼图
‘scatter’ : scatter plot#散点图 需要传入columns方向的索引
‘hexbin’ : hexbin plot#不了解此图
来自:
https://blog.csdn.net/brucewong0516/article/details/80524442