异常值检测:数据挖掘工作中的第一步就是异常值检测,异常值的存在会影响实验结果。异常值是指样本中的个别值,也称为离群点,其数值明显偏离其余的观测值。常用检测方法3 σ 原则和箱型图。其中,3 σ 原则只适用服从正态分布的数据。在3 σ 原则下,异常值被定义为观察值和平均值的偏差超过3倍标准差的值。 P(|x−μ|>3σ)≤0.003 ,在正太分布假设下,大于3 σ 的值出现的概率小于0.003,属于小概率事件,故可认定其为异常值。
-
3 σ 原则对数据分布有一定限制,而箱型图并不限制数据分布,只是直观表现出数据分布的本来面貌。其识别异常值的结果比较客观,而且判断标准以四分位数和四分位间距为标准,多达25%的数据可以变得任意远而不会扰动这个标准,鲁棒性更强,所以更受大家亲睐。
-
箱型图识别异常值标准: 异常值被定义为大于 QU+1.5IQR 或小于 QL−1.5IQR 的值。 QU 是上四分位数,表示全部观察值中有1/4的数据比他大, QL 是下四分位数,表示全部数据中有1/4的数据比他小。IQR是四分位间距,是 QU 和 QL 的差,其间包含了观察值的一半。
箱型图检测异常值实战:
对10位歌手近6个月的播放量数据集进行异常值检测. 数据集每一列表示歌手6个月的播放量,共10列.每一行表示每一天的播放量,共183天.
音乐播放量数据.
<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#-*- coding: utf-8 -*-</span> import pandas <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> pd <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">number</span> = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'../data/all_musicers.xlsx'</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#设定播放数据路径,该路径为代码所在路径的上一个目录data中.</span> data = pd.read_excel(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">number</span>) data1=data.iloc[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>]<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#10位歌手的183天音乐播放量</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#data2=data.iloc[:,10:20]</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#data3=data.iloc[:,20:30]</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#data4=data.iloc[:,30:40]</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#data5=data.iloc[:,40:50]</span> import matplotlib.pyplot <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> plt <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#导入图像库</span> plt.rcParams[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'font.sans-serif'</span>] = [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'SimHei'</span>] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#用来正常显示中文标签</span> plt.rcParams[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'axes.unicode_minus'</span>] = False <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#用来正常显示负号</span> plt.figure(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, figsize=(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">26</span>))<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#可设定图像大小</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#plt.figure() #建立图像</span> p = data1.boxplot() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#画箱线图,直接使用DataFrame的方法.代码到这为止,就已经可以显示带有异常值的箱型图了,但为了标注出异常值的数值,还需要以下代码进行标注.</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#for i in range(0,4):</span> x = p[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'fliers'</span>][<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].get_xdata() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 'flies'即为异常值的标签.[0]是用来标注第1位歌手的异常值数值,同理[i]标注第i+1位歌手的异常值.</span> y = p[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'fliers'</span>][<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].get_ydata() y.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">sort</span>() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#从小到大排序</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-operator" style="box-sizing: border-box;">in</span> range(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">len</span>(x)): <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> i><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>: plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.05</span> -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.8</span>/(y[i]-y[i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]),y[i])) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>: plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.08</span>,y[i])) plt.show() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#展示箱线图</span> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li></ul>
若想同时在一张图上标注所有的歌手异常值的数值, 可以这样做:
x0 = p[‘fliers’][0].get_xdata() # ‘flies’即为异常值的标签.
y0= p[‘fliers’][0].get_ydata()
y0.sort() #从小到大排序
for i in range(len(x0)):
if i>0:
plt.annotate(y0[i], xy = (0x[i],y0[i]), xytext=(x0[i]+0.05 -0.8/(y0[i]-y0[i-1]),y0[i]))
else:
plt.annotate(y0[i], xy = (x0[i],y0[i]), xytext=(x0[i]+0.08,y0[i]))
上述代码将x0换成xi就表示给第i+1位歌手添加异常值标注. 在所有的歌手异常值都标注完后,执行plt.show() #展示所有异常值标注的箱型图.
<code class="hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
输出结果如下:其中,+所表示的均是(统计学认为的)异常值.工作中,要结合数据应用背景, 距离箱型图上下界很近的可归为正常值.
异常值处理:
- 删除:对于数据量比较小的数据,删除会造成样本不足,减少有用信息。
- 视为缺失值:用均值、插值等方法进行填补
- 不处理:将缺失值视为一种特征,统计其缺失个数等信息作为缺失特征。
本文将异常值视为缺失值,并用前后值的均值来填补.代码如下:
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>): if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">125</span>: data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>): if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">600</span>: data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>): if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">225</span>: data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>): if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">60</span>: data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span> for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>): if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2500</span>: data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li></ul>
处理完异常值后,导出数据,保存:
<code class="hljs vala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#datan=pd.concat([data1,data2,data3,data4,data5],axis=1) </span> data1.to_csv(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"train_innoraml.csv"</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
保存时有时会出现这种问题:
UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-1: ordinal not in range(128)
解决方法,输入以下代码:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
转载自http://blog.csdn.net/shuaishuai3409/article/details/51428106