pands最快入门(三)-统计与缺失值处理

pands统计

本系列教程采用的数据为上证01号股票数据,感谢Tushare金融社区的数据分享

isnull和notnull:用于检测是否为空值,可用于df和series
dropna:删除缺失值
fillna:填充缺失值

import pandas as pd
df=pd.read_excel('000001.SZ.xlsx',encoding='gbk')
#inpalce控制生成新的df或者替换原来的df
df.set_index('trade_date',inplace=True)
df.head()
ts_codeopenhighlowclosechangepct_chgvolamountpre_close
trade_date
20160104000001.SZ12.0012.0311.2311.33-0.66-5.50563497.87660376.153111.99
20160105000001.SZ11.2711.5711.1511.400.070.62663269.95755531.353711.33
20160106000001.SZ11.4211.5611.3911.530.131.14515706.44591698.520411.40
20160107000001.SZ11.4111.4110.9110.94-0.59-5.12174761.10194869.493511.53
20160108000001.SZ11.2111.2910.9011.120.181.65747527.58831334.546210.94
汇总类统计
#提取常用统计结果
df.describe()
openhighlowclosechangepct_chgvolamountpre_close
count1218.0000001218.0000001218.0000001218.0000001218.0000001218.0000001.218000e+031.218000e+031218.000000
mean11.99837412.15709411.85669112.0118720.0081120.0776239.472223e+051.190617e+0612.003760
std2.6706322.7385982.6059822.6739900.2417131.8678645.871951e+058.540574e+052.667239
min8.5600008.6000008.4500008.570000-1.550000-9.9743001.707778e+051.755915e+058.570000
25%9.5700009.7000009.4800009.570000-0.100000-0.8185505.377827e+055.597184e+059.562500
50%11.20500011.35500011.03500011.2100000.0000000.0000007.969810e+059.731886e+0511.210000
75%13.87000014.10000013.66000013.9300000.1000000.8800001.183685e+061.553497e+0613.927500
max20.00000020.88000019.59000020.0500001.43000010.0351004.711461e+067.168653e+0620.050000
#查看单个Series的数据
df['high'].mean()
12.157093596059095
唯一去重和按值计数

一般不用于数值列,而是枚举,分类列

df['ts_code'].unique()
array(['000001.SZ'], dtype=object)
df['ts_code'].value_counts()
000001.SZ    1218
Name: ts_code, dtype: int64
相关系数和协方差
df=df.drop('ts_code',axis=1)
df.head()
openhighlowclosechangepct_chgvolamountpre_close
trade_date
2016010412.0012.0311.2311.33-0.66-5.50563497.87660376.153111.99
2016010511.2711.5711.1511.400.070.62663269.95755531.353711.33
2016010611.4211.5611.3911.530.131.14515706.44591698.520411.40
2016010711.4111.4110.9110.94-0.59-5.12174761.10194869.493511.53
2016010811.2111.2910.9011.120.181.65747527.58831334.546210.94
#协方差矩阵
df.corr()
openhighlowclosechangepct_chgvolamountpre_close
open1.0000000.9980880.9986060.9967110.000257-0.0072850.3259690.5525110.999211
high0.9980881.0000000.9981660.9987390.0433910.0359160.3578920.5812540.997335
low0.9986060.9981661.0000000.9985390.0354250.0267930.3170930.5454380.997856
close0.9967110.9987390.9985391.0000000.0730930.0646210.3425350.5675940.995907
change0.0002570.0433910.0354250.0730931.0000000.9827890.1961580.193215-0.017345
pct_chg-0.0072850.0359160.0267930.0646210.9827891.0000000.2129190.199264-0.024279
vol0.3259690.3578920.3170930.3425350.1961580.2129191.0000000.9554980.325626
amount0.5525110.5812540.5454380.5675940.1932150.1992640.9554981.0000000.551521
pre_close0.9992110.9973350.9978560.995907-0.017345-0.0242790.3256260.5515211.000000
#查看两个变量的相关系数
df['high'].corr(df['low'])
0.9981656586604757

pandas缺失值处理

df=pd.read_excel('fraction.xlsx',encoding='utf-8',skiprows=2)
df
Unnamed: 0姓名科目分数
0NaN小明语文85.0
1NaNNaN数学90.0
2NaNNaN英语80.0
3NaNNaNNaNNaN
4NaN小王语文85.0
5NaNNaN数学NaN
6NaNNaN英语90.0
7NaNNaNNaNNaN
8NaN小刚语文80.0
9NaNNaN数学85.0
10NaNNaN英语90.0
检测空值
df.isnull()
Unnamed: 0姓名科目分数
0TrueFalseFalseFalse
1TrueTrueFalseFalse
2TrueTrueFalseFalse
3TrueTrueTrueTrue
4TrueFalseFalseFalse
5TrueTrueFalseTrue
6TrueTrueFalseFalse
7TrueTrueTrueTrue
8TrueFalseFalseFalse
9TrueTrueFalseFalse
10TrueTrueFalseFalse
df['分数'].isnull()
0     False
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9     False
10    False
Name: 分数, dtype: bool
#筛选没有空分数的行
df.loc[df['分数'].notnull(),:]
Unnamed: 0姓名科目分数
0NaN小明语文85.0
1NaNNaN数学90.0
2NaNNaN英语80.0
4NaN小王语文85.0
6NaNNaN英语90.0
8NaN小刚语文80.0
9NaNNaN数学85.0
10NaNNaN英语90.0
删掉全是空值的列
#how可以为all或any
df.dropna(axis=1,how='all',inplace=True)
df.dropna(axis=0,how='all',inplace=True)
df
姓名科目分数
0小明语文85.0
1NaN数学90.0
2NaN英语80.0
4小王语文85.0
5NaN数学NaN
6NaN英语90.0
8小刚语文80.0
9NaN数学85.0
10NaN英语90.0
将单个空值填充
#填充函数
df.fillna({'分数':0})
姓名科目分数
0小明语文85.0
1NaN数学90.0
2NaN英语80.0
4小王语文85.0
5NaN数学0.0
6NaN英语90.0
8小刚语文80.0
9NaN数学85.0
10NaN英语90.0
#单列修改,与上面功能相同
df['分数']=df['分数'].fillna(0)
df
姓名科目分数
0小明语文85.0
1NaN数学90.0
2NaN英语80.0
4小王语文85.0
5NaN数学0.0
6NaN英语90.0
8小刚语文80.0
9NaN数学85.0
10NaN英语90.0
#ffill=forward fill
df['姓名']=df['姓名'].fillna(method='ffill')
df
姓名科目分数
0小明语文85.0
1小明数学90.0
2小明英语80.0
4小王语文85.0
5小王数学0.0
6小王英语90.0
8小刚语文80.0
9小刚数学85.0
10小刚英语90.0
#将清洗好的数据保存
df.to_excel('fraction_updated.xlsx',index=False)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值