3.9 累计与分组
3.9.1 行星数据
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
(1035, 6)
planets.head()
|
method |
number |
orbital_period |
mass |
distance |
year |
0 |
Radial Velocity |
1 |
269.300 |
7.10 |
77.40 |
2006 |
1 |
Radial Velocity |
1 |
874.774 |
2.21 |
56.95 |
2008 |
2 |
Radial Velocity |
1 |
763.000 |
2.60 |
19.84 |
2011 |
3 |
Radial Velocity |
1 |
326.030 |
19.40 |
110.62 |
2007 |
4 |
Radial Velocity |
1 |
516.220 |
10.50 |
119.47 |
2009 |
数据包括截止2014年已被发现的一千多颗外行星的资料.
3.9.2 Pandas的简单累计功能
import numpy as np
import pandas as pd
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64
ser.sum()
2.811925491708157
ser.mean()
0.5623850983416314
DataFrame的累计函数默认对列进行统计
df = pd.DataFrame({
'a':rng.rand(5),
'b':rng.rand(5)})
df
|
a |
b |
0 |
0.155995 |
0.020584 |
1 |
0.058084 |
0.969910 |
2 |
0.866176 |
0.832443 |
3 |
0.601115 |
0.212339 |
4 |
0.708073 |
0.181825 |
df.mean()
a 0.477888
b 0.443420
dtype: float64
设置axis参数, 可以对每一行进行统计
df.mean(axis='columns')
0 0.088290
1 0.513997
2 0.849309
3 0.406727
4 0.444949
dtype: float64
丢弃有缺失值的行
planets.dropna().describe()
|
number |
orbital_period |
mass |
distance |
year |
count |
498.00000 |
498.000000 |
498.000000 |
498.000000 |
498.000000 |
mean |
1.73494 |
835.778671 |
2.509320 |
52.068213 |
2007.377510 |
std |
1.17572 |
1469.128259 |
3.636274 |
46.596041 |
4.167284 |
min |
1.00000 |
1.328300 |
0.003600 |
1.350000 |
1989.000000 |
25% |
1.00000 |
38.272250 |
0.212500 |
24.497500 |
2005.000000 |
50% |
1.00000 |
357.000000 |
1.245000 |
39.940000 |
2009.000000 |
75% |
2.00000 |
999.600000 |
2.867500 |
59.332500 |
2011.000000 |
max |
6.00000 |
17337.500000 |
25.000000 |
354.000000 |
2014.000000 |
pandas的累计方法
指标 |
描述 |
count() |
计数项 |
first(),last() |
第一项与最后一项 |
mean(),median() |
均值与中位数 |
min(),max() |
最小值与最大值 |
std(),var() |
标准差与方差 |
mad() |
均值绝对偏差 |
prod() |
所有项乘积 |
sum() |
所有项求和 |
3.9.3 GrpupBy: 分隔, 应用和组合
df = pd.DataFrame({
'key':['a','b','c','a','b','c'],
'data':range(6)})
df
|
key |
data |
0 |
a |
0 |
1 |
b |
1 |
2 |
c |
2 |
3 |
a |
3 |
4 |
b |
4 |
5 |
c |
5 |
df.groupby('key')
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000000F7FD8A1048>
df.groupby('key').sum()
df.groupby('key').mean()
|
data |
key |
|
a |
1.5 |
b |
2.5 |
c |
3.5 |
df.groupby('key').last()
GroupBy中最重要的操作可能是aggregate, filter, transform和apply(累计,过滤,转换,应用)
按列取值,返回修改过的GroupBy对象
planets.groupby('method')
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000000F7FAA18E10>
planets.groupby('method')['orbital_period']
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000000F7FD8B90B8>
planets.groupby('method')['orbital_period'].median()
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit