Sklearn

最新推荐文章于 2024-06-14 09:55:08 发布

YoutiaoNo2

最新推荐文章于 2024-06-14 09:55:08 发布

阅读量128

点赞数

本文链接：https://blog.csdn.net/YoutiaoNo2/article/details/109109222

版权

该篇博客涵盖了机器学习中的关键算法，包括KNN的参数设置，如邻居数量、距离权重和距离度量；极限森林的随机性特点及其在数据处理中的应用；XGBoost的参数调整与优缺点分析；以及LightGBM对XGBoost的改进。此外，还介绍了如何使用LabelEncoder进行数据预处理，绘制热力图展示特征相关性，以及异常值检测与过滤的方法。最后，涉及到了模型评估与调参技巧。

摘要由CSDN通过智能技术生成

Sklearn

- 知识点

知识点

KNN参数：
- n_neighbors：邻居数量
- weights：‘distance’
- n_jobs：多进程，-1则为默认使用所有进程
- p：1采用曼哈顿距离，2采用欧式距离
- metrics：‘Minkovski’
x.ravel()等同于x.reshape(-1,1)平铺。
from matplotlib.colors import ListedColormap
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, StratifiedKFold
Label encoder函数：

## Label encoder
def convert(x):
	return np.argwhere(u == x)[0,0]
df[col].map(conver)

from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder,LabelEncoder
极限森林
- 样本随机
- 分裂条件随机(不是最好的分裂条件)
  像在随机森林中一样，使用候选特征的随机子集，但不是寻找最有区别的阈值，而是为每个候选特征随机绘制阈值，并选择这些随机生成的阈值中的最佳阈值作为划分规则。

from sklearn.ensemble import ExtraTreesClassifier

https证书报错

#Solution:
import ssl 
ssl._create_default_https_context = ssl._create_unverified_context

Xgboost参数：
参考：https://blog.csdn.net/wzmsltw/article/details/50994481
Xgboost优缺点：
- 优点：
  - 利用二阶梯度对节点进行划分，相比于GBDT精度更高
  - 利用局部近似算法对分裂节点的贪心算法优化，取适当的eps，可以保持算法的优越性
  - 损失函数中加入L1L2惩罚，控制模型复杂度，提高鲁棒性
  - 提供并行计算能力
  - Tree Shrinkage, column subsampling等不同处理细节
- 缺点：
  - 需要pre-sorted，会消耗很多空间(2*#data*#features)
  - 尽管使用了局部近似计算，处理粒度太细
  - 由于pre-sorted处理数据，在寻找特征分裂点时，会产生大量的cache
Lightgbm: 点亮梯度提升树。调参：
https://blog.csdn.net/aliceyangxi1987/article/details/80711014?biz_id=102&utm_term=lightgbm%E8%B0%83%E5%8F%82&utm_medium=distribute.pc_search_result.none-task-blog-2_allsobaiduweb~default-0-80711014&spm=1018.2118.3001.4187
基于xgboost的缺点进行的改进：
- LightGBM基于histogram算法代替pre-sorted所构建的数据结构，利用histogram后，会有很多的tricks，可以提高cache命中率(主要是因为使用了leaf-wise)
- 面对大数据时会使用采样的方法来提高训练速度，又或者在训练的时候赋予样本权值，LightGBM采用了GOSS算法
- 由于histogram算法对稀疏数据的处理时间复杂度没有pre-sorted好，因为histogram并不管特征值是否为0，因此采用了EFB来预处理稀疏数据。
Xgboost和LightGBM的比较：
https://blog.csdn.net/weixin_38664232/article/details/88969341?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522160372408819724822545545%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257D&request_id=160372408819724822545545&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2_allfirst_rank_v2~rank_v28-1-88969341.pc_first_rank_v2_rank_v28&utm_term=xgboost%E5%92%8Clightgbm%E5%8C%BA%E5%88%AB&spm=1018.2118.3001.4187
热力图：

# 找出相关程度
plt.figure(figsize=(20, 16))  # 指定绘图对象宽度和高度
mcorr = train.corr()  # 相关系数矩阵，即给出了任意两个变量之间的相关系数
mask = np.zeros_like(mcorr, dtype=np.bool)  # 构造与mcorr同维数矩阵 为bool型

mask[np.triu_indices_from(mask)] = True  # 角分线右侧为True
# 颜色
cmap = sns.diverging_palette(220, 10, as_cmap=True)  # 返回matplotlib colormap对象
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')  # 热力图（看两两相似度）
plt.show()

过滤异常值：

ridge = RidgeCV(alphas=[0.0001,0.001,0.01,0.1,0.2,0.5,1,2,3,4,5,10,20,30,50])

cond = data_all_norm['origin'] == 'train'

X_train = data_all_norm[cond].iloc[:,:-2]
# 真实值
y_train = data_all_norm[cond]['target']
# 算法拟合数据和目标值的时候，不可能100%拟合
ridge.fit(X_train,y_train)
# 预测，预测值肯定会和真实值有一定的偏差，偏差特别大，当成异常值
y_ = ridge.predict(X_train)

cond = abs(y_ - y_train) > y_train.std()
print(cond.sum())
# 画图
plt.figure(figsize=(12,6))
axes = plt.subplot(1,3,1)
axes.scatter(y_train,y_)
axes.scatter(y_train[cond],y_[cond],c = 'red',s = 20)

axes = plt.subplot(1,3,2)
axes.scatter(y_train,y_train - y_)
axes.scatter(y_train[cond],(y_train - y_)[cond],c = 'red')

axes = plt.subplot(1,3,3)
# _ = axes.hist(y_train,bins = 50)
(y_train - y_).plot.hist(bins = 50,ax = axes)
(y_train - y_).loc[cond].plot.hist(bins = 50,ax = axes,color = 'r')

index = cond[cond].index
data_all_norm.drop(index,axis = 0,inplace=True)