python用箱型图进行异常值检测

异常值检测:数据挖掘工作中的第一步就是异常值检测,异常值的存在会影响实验结果。异常值是指样本中的个别值,也称为离群点,其数值明显偏离其余的观测值。常用检测方法3 σ 原则和箱型图。其中,3 σ 原则只适用服从正态分布的数据。在3 σ 原则下,异常值被定义为观察值和平均值的偏差超过3倍标准差的值。 P(|xμ|>3σ)0.003 ,在正太分布假设下,大于3 σ 的值出现的概率小于0.003,属于小概率事件,故可认定其为异常值。

  • 3 σ 原则对数据分布有一定限制,而箱型图并不限制数据分布,只是直观表现出数据分布的本来面貌。其识别异常值的结果比较客观,而且判断标准以四分位数和四分位间距为标准,多达25%的数据可以变得任意远而不会扰动这个标准,鲁棒性更强,所以更受大家亲睐。

  • 箱型图识别异常值标准: 异常值被定义为大于 QU+1.5IQR 或小于 QL1.5IQR 的值。 QU 是上四分位数,表示全部观察值中有1/4的数据比他大, QL 是下四分位数,表示全部数据中有1/4的数据比他小。IQR是四分位间距,是 QU QL 的差,其间包含了观察值的一半。 
    这里写图片描述


箱型图检测异常值实战: 
对10位歌手近6个月的播放量数据集进行异常值检测. 数据集每一列表示歌手6个月的播放量,共10列.每一行表示每一天的播放量,共183天. 
音乐播放量数据.

<code class="hljs livecodeserver has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#-*- coding: utf-8 -*-</span>
import pandas <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> pd
<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">number</span> = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'../data/all_musicers.xlsx'</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#设定播放数据路径,该路径为代码所在路径的上一个目录data中.</span>
data = pd.read_excel(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">number</span>)

data1=data.iloc[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>]<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#10位歌手的183天音乐播放量</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#data2=data.iloc[:,10:20]</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#data3=data.iloc[:,20:30]</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#data4=data.iloc[:,30:40]</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#data5=data.iloc[:,40:50]</span>
import matplotlib.pyplot <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> plt <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#导入图像库</span>
plt.rcParams[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'font.sans-serif'</span>] = [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'SimHei'</span>] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#用来正常显示中文标签</span>
plt.rcParams[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'axes.unicode_minus'</span>] = False <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#用来正常显示负号</span>
plt.figure(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, figsize=(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">26</span>))<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#可设定图像大小</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#plt.figure() #建立图像</span>
p = data1.boxplot() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#画箱线图,直接使用DataFrame的方法.代码到这为止,就已经可以显示带有异常值的箱型图了,但为了标注出异常值的数值,还需要以下代码进行标注.</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#for i in range(0,4):</span>
x = p[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'fliers'</span>][<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].get_xdata() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 'flies'即为异常值的标签.[0]是用来标注第1位歌手的异常值数值,同理[i]标注第i+1位歌手的异常值.</span>
y = p[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'fliers'</span>][<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>].get_ydata()
y.<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">sort</span>() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#从小到大排序</span>

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-operator" style="box-sizing: border-box;">in</span> range(<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">len</span>(x)): 
  <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> i><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>:
    plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.05</span> -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.8</span>/(y[i]-y[i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]),y[i]))
  <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>:
    plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.08</span>,y[i]))

plt.show() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#展示箱线图</span>

</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li></ul>

若想同时在一张图上标注所有的歌手异常值的数值, 可以这样做: 
x0 = p[‘fliers’][0].get_xdata() # ‘flies’即为异常值的标签. 
y0= p[‘fliers’][0].get_ydata() 
y0.sort() #从小到大排序 
for i in range(len(x0)): 
if i>0: 
plt.annotate(y0[i], xy = (0x[i],y0[i]), xytext=(x0[i]+0.05 -0.8/(y0[i]-y0[i-1]),y0[i])) 
else: 
plt.annotate(y0[i], xy = (x0[i],y0[i]), xytext=(x0[i]+0.08,y0[i])) 
上述代码将x0换成xi就表示给第i+1位歌手添加异常值标注. 在所有的歌手异常值都标注完后,执行plt.show() #展示所有异常值标注的箱型图.

<code class="hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>

输出结果如下:其中,+所表示的均是(统计学认为的)异常值.工作中,要结合数据应用背景, 距离箱型图上下界很近的可归为正常值. 
这里写图片描述


异常值处理:

  • 删除:对于数据量比较小的数据,删除会造成样本不足,减少有用信息。
  • 视为缺失值:用均值、插值等方法进行填补
  • 不处理:将缺失值视为一种特征,统计其缺失个数等信息作为缺失特征。

本文将异常值视为缺失值,并用前后值的均值来填补.代码如下:

<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>):
    if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">125</span>:
        data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>
for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>):
    if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">600</span>:
        data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>        
for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>):
    if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">225</span>:
        data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>
for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>):
    if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">60</span>:
        data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>
for i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">182</span>):
    if data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>][i]><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2500</span>:
        data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>][i]=(data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>][i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+data1<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.iloc</span>[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>][i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li></ul>

处理完异常值后,导出数据,保存:

<code class="hljs vala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#datan=pd.concat([data1,data2,data3,data4,data5],axis=1)    </span>
data1.to_csv(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"train_innoraml.csv"</span>) </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

保存时有时会出现这种问题: 
UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-1: ordinal not in range(128)

解决方法,输入以下代码: 
import sys 
reload(sys) 

sys.setdefaultencoding('utf-8')



转载自http://blog.csdn.net/shuaishuai3409/article/details/51428106

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值