Python科学计算：Pandas （二）

最新推荐文章于 2023-12-06 14:37:50 发布

嘎嘣儿脆

最新推荐文章于 2023-12-06 14:37:50 发布

阅读量278

点赞数

文章标签：数据分析 python

本文链接：https://blog.csdn.net/weixin_44844361/article/details/105618458

版权

数据清洗是数据准备中必不可少的环节，下面简单介绍Pandas在数据清洗中的使用方法。这里沿用上一节的虚拟考试成绩。

import pandas as pd
from pandas import Series,DataFrame
data={'Chinese':[66,89,65,67,67],'English':[87,64,86,88,88],'Math':[90,89,87,98,98]}
df2=DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','DianWei','DianWei'],columns=['English','Chinese','Math'])

下面介绍数据清洗过程中的几种情况：

删除DataFrame中不必要的列或行

drop()
例如，删去‘Chinese’列

df2=df2.drop(columns=['Chinese'])
print (df2)

运行结果：

            English  Chinese  Math
ZhangFei       87       66    90
GuanYu         64       89    89
ZhaoYun        86       65    87
DianWei        88       67    98
DianWei        88       67    98
            English  Math
ZhangFei       87    90
GuanYu         64    89
ZhaoYun        86    87
DianWei        88    98
DianWei        88    98

第一个表是未删除语文成绩的表，第二个表是删除语文成绩的表。下面演示删除“张飞”行。

df2=df2.drop(index=['ZhangFei'])

运行结果：

           English  Chinese  Math
GuanYu        64       89    89
ZhaoYun       86       65    87
DianWei       88       67    98
DianWei       88       67    98

重命名列名columns，使列表名更容易识别

rename(columns=new_names,inplace=True)
例如：将列名“Chinese”改成“语文”，“English”改成“英语”，“Math”改成“数学”。

df2.rename(columns={'Chinese':'语文','English':'英语','Math':'数学'},inplace=True)
print (df2)

运行结果：

        英语  语文  数学
ZhangFei  87   66    90
GuanYu    64   89    89
ZhaoYun   86   65    87
DianWei   88   67    98
DianWei   88   67    98

去重复值

数据采集可能存在重复的行，drop_duplicates()会自动把重复的行去掉。

df2=df2.drop_duplicates()
print (df2)

运行结果：

        英语  语文  数学
ZhangFei  87   66    90
GuanYu    64   89    89
ZhaoYun   86   65    87
DianWei   88   67    98

可以看到，重复的一行，“DianWei”行被过滤掉了。

格式问题

更改数据格式

很多时候数据格式不规范，使用astype()。例如，将Chinese字段改成str类型或者int64类型。

df2['语文'].astype('str')
df2['语文'].astype(np.int64)

数据间的空格

我们先把数据转为str型，是为了方便对数据进行操作，这时要删除数据间的空格，可以用函数strip():

df2['语文']=df2['语文'].astype('str')
df2['语文']=df2['语文'].map(str.strip)#删除两边空格
df2['语文']=df2['语文'].map(str.lstrip)#删除左边空格
df2['语文']=df2['语文'].map(str.rstrip)#删除右边空格
print (df2)

若数据中有特殊符号，例如，“语文”字段里有美元符号，需要删除。

df2['语文']=df2['语文'].str.strip('$')

大小写转换

人名，城市名等的统一，都可能用到大小写的转换，Python中，直接使用upper(),lower(),title(),方法如下：

import pandas as pd
from pandas import Series,DataFrame
data={'Chinese':[66,89,65,67,67],'English':[87,64,86,88,88],'Math':[90,89,87,98,98]}
df2=DataFrame(data,index=['ZhangFei','GuanYu','ZhaoYun','DianWei','DianWei'],columns=['English','Chinese','Math'])
df2['Chinese']=df2['Chinese'].astype('str')
df2.columns=df2.columns.str.upper()#全部大写
print(df2)
df2.columns=df2.columns.str.lower()#全部小写
df2.columns=df2.columns.str.title()#首字母大写

运行结果：

           ENGLISH CHINESE  MATH
ZhangFei       87      66    90
GuanYu         64      89    89
ZhaoYun        86      65    87
DianWei        88      67    98
DianWei        88      67    98

运行结果以字母全部大写为例。

查找空值

数据量大时，有些字段可能存在空值NaN，这时需要Python中的isnull()函数进行查找。例如以下数据表：

           English  Chinese  Math
ZhangFei       87       66   NaN
GuanYu         64       89  89.0
ZhaoYun        86       65  87.0
DianWei        88       67  98.0
DianWei        88       67  98.0

如果想看哪个位置存在空值，print(df2.isnull()),运行结果如下：

          English  Chinese   Math
ZhangFei    False    False   True
GuanYu      False    False  False
ZhaoYun     False    False  False
DianWei     False    False  False
DianWei     False    False  False

若想知道哪一列有空值，使用df2.isnull().any(),结果如下：

English    False
Chinese    False
Math        True
dtype: bool

使用apply函数对数据进行清洗

apply函数在Pandas中使用频率比较高，是自由度非常高的函数。如想对name列的值都进行大写转化：

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
data={'name':['ZhangFei','GuanYu','ZhaoYun','DianWei','DianWei'],'Chinese':[66,89,65,67,67],'English':[87,64,86,88,88],'Math':[None,89,87,98,98]}
df2=DataFrame(data)
df2['Chinese']=df2['Chinese'].astype('str')
df2['name']=df2['name'].apply(str.upper)
print(df2)

运行结果：

     name    Chinese  English  Math
0  ZHANGFEI      66       87   NaN
1    GUANYU      89       64  89.0
2   ZHAOYUN      65       86  87.0
3   DIANWEI      67       88  98.0
4   DIANWEI      67       88  98.0

我们也可以自定义函数，在apply中使用。如定义函数double_df,将“Chinese”列的数值*2处理。

def double_df(x):
    return 2*x
df2[u'Chinese']=df2[u'Chinese'].apply(double_df)
print (df2)

运行结果：

    name      Chinese  English  Math
0  ZhangFei      132       87   NaN
1    GuanYu      178       64  89.0
2   ZhaoYun      130       86  87.0
3   DianWei      134       88  98.0
4   DianWei      134       88  98.0

可以看到，语文成绩都扩大到了原来的2倍。我们也可以定义更复杂的函数，如在原来数据表的基础上增加两列，‘new1’列表示语文和英语成绩和的m倍，‘new2’列表示语文和英语成绩和的n倍。

def plus(df2,n,m):
    df2['new1']=(df2[u'Chinese']+df2[u'English'])*m
    df2['new2'] = (df2[u'Chinese'] + df2[u'English']) * n
    return df2
df2=df2.apply(plus,axis=1,args=(2,3,))
print (df2)

运行结果：

    name      Chinese  English  Math  new1  new2
0  ZhangFei       66       87   NaN   459   306
1    GuanYu       89       64  89.0   459   306
2   ZhaoYun       65       86  87.0   453   302
3   DianWei       67       88  98.0   465   310
4   DianWei       67       88  98.0   465   310

其中axis=1代表按照列为轴，axis=0代表按照行为轴进行操作，args是传递两个参数的，n=2，m=3，在plus中使用了n和m，生成了新的df2。