STATA 学习笔记: outlier(离群值)的处理
处理办法
1. 取对数
sysuse nlsw88, clear
histogram wage
时薪大于40美元的属于离群值
gen lwage = ln(wage)
histogram lwage
取对数之后,接近正态分布
2. 删除/截尾(trimming)
标准:
①标准差:超过标准差的2倍或3倍
②1%或99%分位
命令 winsor2,trim
Description
winsor2 winsorize or trim (if trim option is specified) the variables in varlist at particular percentiles specified by option cuts(#1 #2). In defult, new variables will be generated with a suffix "_w" or "_tr", which can be changed by specifying suffix() option. The replace option replaces the variables with their winsorized or trimmed ones.
sysuse nlsw88, clear
sum wage,detail
sysuse nlsw88, clear
winsor2 wage, cuts(1 99) replace trim
//不加trim就是缩尾,加trim才是截尾
count if (wage<1.930993)
//0
用截尾处理之后的数据新建变量,并添加后缀
添加==suffix()==选项
sysuse nlsw88, clear
winsor2 wage, cuts(1 99) suffix (wage_tr) trim
//新变量名称为wagewage_tr
winsor2 wage, cuts(1 99) suffix (_tr) trim
//新变量名称为wage_tr
count if (wage_tr<1.930993)
//0
不添加==suffix()==选项
sysuse nlsw88, clear
winsor2 wage, cuts(1 99) trim
//默认会用_tr后缀生成新变量wage_tr,不需要添加suffix(_tr)命令
//不更改原有wage变量中的数据,和添加replace之后的情况不同
count if (wage<1.930993)
//23
count if (wage_tr<1.930993)
//0
3. 缩尾
(1)命令winsor2
不添加replace选项时,会新建一个变量wage_w
sysuse nlsw88, clear
//In defult, winsor2 winsorize wage at 1th and 99th percentiles
winsor2 wage
//生成新变量wage_w
等价于以下命令
winsor2 wage, cuts(1 99)
//将小于1%分位的数值统一替换为该分位数值,将大于99%分位的数值统一替换为该分位数值
count if (wage<1.930993)
//23
等价于以下命令
replace wage=1.930993 if wage<1.930993
replace wage=38.70926 if wage>38.70926
(2)命令winsor2,replace
添加replace选项之后,没有新建变量wage_w
winsor2 wage, cuts(0.5 99.5) replace
//将小于0.5%分位的数值统一替换为该分位数值,将大于99.5%分位的数值统一替换为该分位数值
/等价于以下命令
winsor2 wage, replace cuts(0.5 99.5)