数据分析记录

weixin_42663919

已于 2022-01-27 13:58:06 修改

阅读量268

点赞数

分类专栏：笔记文章标签：数据分析数据挖掘 r语言

于 2020-09-30 09:47:03 首次发布

本文链接：https://blog.csdn.net/weixin_42663919/article/details/108879996

版权

笔记专栏收录该内容

18 篇文章 0 订阅

订阅专栏

记录数据分析套路

1、数据清洗

```
one hot:用于离散变量
```

卡方分箱+WOE编码：用于连续特征 https://zhuanlan.zhihu.com/p/146476834

卡方分箱+WOE编码
–“可以把非线性的特征转化为线性”.例如在风控场景里，我们可能用到客户的年龄做特征。我们知道肯定不是年龄越大风险越高，或者年龄越大风险越低，一定是有个年龄段的风险是比其他年龄段高些。

```
z-normalize: 用于连续变量
```
```
min-max normalized:用于连续变量
```

二值化：用于连续变量(连续变量分组，转化为分类变量0/1)

根据年龄分组进行z-normalize： 
当大多数变量与年龄相关，当变量直方图一坨一坨分开不连续,，用于连续变量
eg:x1属于Agegroup6,x1->(x1-mean(xi,xi属于Agegroup6))/std(xi,xi属于Agegroup6)
记得连续变量按年龄分组z-normalize后，“年龄变量”需要min-max normalized，分类变量全部二值化

```
分箱  
```

分箱：
等距分箱、等频分箱、卡方分箱、最小熵分箱：https://cloud.tencent.com/developer/article/1388206
KS分箱：https://blog.csdn.net/hxcaifly/article/details/84593770
其他：（看注释）缺失与分箱：https://blog.csdn.net/happy5205205/article/details/95062467；代码：https://zhuanlan.zhihu.com/p/355796708

树分箱：https://blog.csdn.net/fulk6667g78o8/article/details/120318104
‘
卡方分箱、树分箱是有监督的。训练集分箱完会获得特征相邻连续的几个区间以及区间的上下限，即每个箱子不重合，箱子的上下限可直接用于测试集特征分箱

2、特征选择

根据缺失筛选变量（注意变量相关性和样本量之间的平衡，当某关键变量缺失过多，可以通过丢弃样本尽量保留变量）
根据统计分析筛选变量

引用 early Recognition of Burn- and trauma-Related Acute Kidney injury: A pilot comparison of Machine Learning techniques

The Shapiro-Wilkes test and histogram analysis were used to determine normality.
.
Continuous normally distributed variables were compared using means (standard deviation[SD]) using the 2-sample t-test, while discrete variables were compared using the non-parametric Chi-square test.Non-parametric continuous data compared using medians (interquartile range [IQR]), when appropriate, were analyzed using the Mann-Whitney U test. categorical variables were represented by frequency(%)
.
Multivariate logistic regression was used to determine predictors of AKI with age and burn size serving as covariates. Repeated measures analysis of variance was used for time series data.
线性回归中F检验、参数t检验、R^2的相关定义（注意Logistic回归对于自变量因变量分布没有要求，而线性回归有较多对于自变量因变量分布的要求）：
https://zhuanlan.zhihu.com/p/48541799?ivk_sa=1024320u
https://www.cnblogs.com/wqbin/p/11109650.html
https://zhuanlan.zhihu.com/p/176688072
https://blog.csdn.net/Noob_daniel/article/details/76087829

.
A p-value<0.05 was considered statistically significant with receiver operator characteristic (ROC) analysis also performed to compare AKI biomarker performance

根据机器学习方法筛选变量

引用 Using data mining techniques for multi-diseases prediction modeling of hypertension and hyperlipidemia by common risk factors

Stage I first selects the risk factors of hypertension and hyperlipidemia using six data mining approaches: logistic regression analysis, C5.0 decision tree, Classification and Regression Tree (CART), Chi-squared Automatic Interaction Detector (CHAID), exhaustive CHAID, and discriminant analysis

根据相关系数筛选变量
根据IV（WOE）指标筛选变量
https://blog.csdn.net/shenxiaoming77/article/details/78771698

weixin_42663919

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据分析记录

记录数据分析套路1、数据清洗2、特征选择根据缺失筛选变量（注意变量相关性和样本量之间的平衡，当某关键变量缺失过多，可以通过丢弃样本尽量保留变量）根据统计分析筛选变量引用 early Recognition of Burn- and trauma-Related Acute Kidney injury: A pilot comparison of Machine Learning techniquesThe Shapiro-Wilkes test and histogram ana
复制链接

扫一扫