知识点
- KNN参数:
- n_neighbors:邻居数量
- weights:‘distance’
- n_jobs:多进程,-1则为默认使用所有进程
- p:1采用曼哈顿距离,2采用欧式距离
- metrics:‘Minkovski’
- x.ravel()等同于x.reshape(-1,1)平铺。
- from matplotlib.colors import ListedColormap
- from sklearn.metrics import classification_report
- from sklearn.model_selection import train_test_split, GridSearchCV, KFold, StratifiedKFold
- Label encoder函数:
def convert(x):
return np.argwhere(u == x)[0,0]
df[col].map(conver)
- from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder,LabelEncoder
- 极限森林
- 样本随机
- 分裂条件随机(不是最好的分裂条件)
像在随机森林中一样,使用候选特征的随机子集,但不是寻找最有区别的阈值,而是为每个候选特征随机绘制阈值,并选择这些随机生成的阈值中的最佳阈值作为划分规则。
from sklearn.ensemble import ExtraTreesClassifier
- https证书报错
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
- Xgboost参数:
参考:https://blog.csdn.net/wzmsltw/article/details/50994481
Xgboost优缺点:
- 优点:
- 利用二阶梯度对节点进行划分,相比于GBDT精度更高
- 利用局部近似算法对分裂节点的贪心算法优化,取适当的eps,可以保持算法的优越性
- 损失函数中加入L1L2惩罚,控制模型复杂度,提高鲁棒性
- 提供并行计算能力
- Tree Shrinkage, column subsampling等不同处理细节
- 缺点:
- 需要pre-sorted,会消耗很多空间(2*#data*#features)
- 尽管使用了局部近似计算,处理粒度太细
- 由于pre-sorted处理数据,在寻找特征分裂点时,会产生大量的cache
- Lightgbm: 点亮梯度提升树。调参:
https://blog.csdn.net/aliceyangxi1987/article/details/80711014?biz_id=102&utm_term=lightgbm%E8%B0%83%E5%8F%82&utm_medium=distribute.pc_search_result.none-task-blog-2allsobaiduweb~default-0-80711014&spm=1018.2118.3001.4187
基于xgboost的缺点进行的改进:
- LightGBM基于histogram算法代替pre-sorted所构建的数据结构,利用histogram后,会有很多的tricks,可以提高cache命中率(主要是因为使用了leaf-wise)
- 面对大数据时会使用采样的方法来提高训练速度,又或者在训练的时候赋予样本权值,LightGBM采用了GOSS算法
- 由于histogram算法对稀疏数据的处理时间复杂度没有pre-sorted好,因为histogram并不管特征值是否为0,因此采用了EFB来预处理稀疏数据。
- Xgboost和LightGBM的比较:
https://blog.csdn.net/weixin_38664232/article/details/88969341?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522160372408819724822545545%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257D&request_id=160372408819724822545545&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2allfirst_rank_v2~rank_v28-1-88969341.pc_first_rank_v2_rank_v28&utm_term=xgboost%E5%92%8Clightgbm%E5%8C%BA%E5%88%AB&spm=1018.2118.3001.4187 - 热力图:
plt.figure(figsize=(20, 16))
mcorr = train.corr()
mask = np.zeros_like(mcorr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')
plt.show()
- 过滤异常值:
ridge = RidgeCV(alphas=[0.0001,0.001,0.01,0.1,0.2,0.5,1,2,3,4,5,10,20,30,50])
cond = data_all_norm['origin'] == 'train'
X_train = data_all_norm[cond].iloc[:,:-2]
y_train = data_all_norm[cond]['target']
ridge.fit(X_train,y_train)
y_ = ridge.predict(X_train)
cond = abs(y_ - y_train) > y_train.std()
print(cond.sum())
plt.figure(figsize=(12,6))
axes = plt.subplot(1,3,1)
axes.scatter(y_train,y_)
axes.scatter(y_train[cond],y_[cond],c = 'red',s = 20)
axes = plt.subplot(1,3,2)
axes.scatter(y_train,y_train - y_)
axes.scatter(y_train[cond],(y_train - y_)[cond],c = 'red')
axes = plt.subplot(1,3,3)
(y_train - y_).plot.hist(bins = 50,ax = axes)
(y_train - y_).loc[cond].plot.hist(bins = 50,ax = axes,color = 'r')
index = cond[cond].index
data_all_norm.drop(index,axis = 0,inplace=True)