python数据分析之筛除无效数据

最新推荐文章于 2023-05-26 08:54:35 发布

天东烛

最新推荐文章于 2023-05-26 08:54:35 发布

阅读量2.9k

点赞数

文章标签：数据分析 python excel 列表

本文链接：https://blog.csdn.net/ApplePay1/article/details/106597217

版权

最近接了一个数学建模的小项目，其中一问是从大量数据中筛除出错的数据和无效数据，打开Excel一看，好家伙——有二十多万个行，十五列，上百万的数据，无效数据的分布还不均匀，刁钻古怪，这就很让人犯难。

更坑爹的是还有干扰项，这要是用Excel的查找替换功得整到猴年马月去啊，幸好……我们学过流畅的python！

下面我将用python解决上述问题，废话不多说，先上代码。

// 
import numpy as np
import pandas as pd
path = "F:/数学建模/校赛题目/2020校赛C题/附件/附件1.xlsx"
s1 = pd.read_excel(path, sheet_name=0)# 将Excel文件读取到pandas DataFrame中

s1 = s1.values     # 转化为数值形式

s1 = pd.DataFrame(s1)   # 转化为dataframe形式

for each in range(0,16):
    s1 = s1[(s1.iloc[:, each] != 'NULL')]  

print(s1)

l = []
list = [9,10]
for c in range(len(list)):  # 将各元素标准差放到列表

    a = s1.iloc[:, list[c]].std()
    l.append(a)
    print(a, '\n')

p = []
for c in range(len(list)):  # 各元素均值放到列表
    a = s1.iloc[:, list[c]].mean()
    p.append(a)
    print(a,'\n')
for c in range(len(list)):
    themin = p[c] -  l[c] # 可以检查是否存在异常值,一般认为超过两个标准差的数据,就是异常值
    themax = p[c] +  l[c]
    print(themin, '  ', themax)
    s1 = s1[(s1.iloc[:, list[c] ] >= themin) & (s1.iloc[:, list[c] ] <= themax)]  # 选出符合的行
    print(s1)
    print('\n', s1.shape[0], '\n')

path = "F:/数学建模/"  # 保存
s1.to_excel(path+"sheet1.xlsx")

首先在合适的路径读取Excel，再转化为dataframe（大佬的dataframe详解）形式，用s1.iloc提取指定行、指定列数据。（对iloc有兴趣的同学可以看看这个）

// 去除十五列内的无效NULL数据
for each in range(0,16):
    s1 = s1[(s1.iloc[:, each] != 'NULL')]

分析数据可知在九，十列区分度较高，我们可以选择这两列进行比较。

比较用的是均值加减方差的区间，在区间之中就判断为好数据，最后保存。

代码参考自此

感兴趣的同学可以私信找我要相关数据。

天东烛

关注

0
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫