Python数据分析第三期--简述Pandas

Python数据分析第三期–简述Pandas

目录

1. Pandas 的数据结构

Pandas
一个强大的分析结构化数据的工具集 。

​ 基础式NumPy提供了高性能矩阵的运算。

​ 提供数据清洗功能。

1.1 Series

​ 类似于一维数组的对象

import pandas as pd

# 通过list构建Series
ser_obj = pd.Series(range(10, 20))
print(type(ser_obj))
# 获取数据
print(ser_obj.values)

# 获取索引
print(ser_obj.index)
[10 11 12 13 14 15 16 17 18 19]
RangeIndex(start=0, stop=10, step=1)
# 预览数据
print(ser_obj.head(3))
0    10
1    11
2    12
dtype: int64
print(ser_obj)
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64
#通过索引获取数据
print(ser_obj[0])
print(ser_obj[8])
10
18
# 索引与数据的对应关系仍保持在数组运算的结果中
print(ser_obj * 2)
print(ser_obj > 15)
0    20
1    22
2    24
3    26
4    28
5    30
6    32
7    34
8    36
9    38
dtype: int64
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
dtype: bool
# 通过dict构建Series
year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)
print(ser_obj2.head())
print(ser_obj2.index)
2001    17.8
2002    20.1
2003    16.5
dtype: float64
Int64Index([2001, 2002, 2003], dtype='int64')
# name属性
ser_obj2.name = 'temp'
ser_obj2.index.name = 'year'
print(ser_obj2.head())
year
2001    17.8
2002    20.1
2003    16.5
Name: temp, dtype: float64

1.2 DataFrame

​ 类似于多维数组/表格数组,索引包括列索引和行索引。

import numpy as np

# 通过ndarray构建DataFrame
array = np.random.randn(5,4)
print(array)

df_obj = pd.DataFrame(array)
print(df_obj.head())
[[-0.81322788 -1.51137049 -1.06359499  1.28648638]
 [ 1.31333252  0.25461725 -0.7915332   0.75370275]
 [-0.09241866 -1.29103143 -0.32951959 -0.76066101]
 [-0.40326652  1.04621326  0.99044616 -0.35491204]
 [ 2.8170386  -0.58422902  2.84249502 -0.62538581]]
          0         1         2         3
0 -0.813228 -1.511370 -1.063595  1.286486
1  1.313333  0.254617 -0.791533  0.753703
2 -0.092419 -1.291031 -0.329520 -0.760661
3 -0.403267  1.046213  0.990446 -0.354912
4  2.817039 -0.584229  2.842495 -0.625386
# 通过dict构建DataFrame
dict_data = {'A': 1., 
             'B': pd.Timestamp('20161217'),
             'C': pd.Series(1, index=list(range(4)),dtype='float32'),
             'D': np.array([3] * 4,dtype='int32'),
             'E' : pd.Categorical(["Python","Java","C++","C#"]),
             'F' : 'ChinaHadoop' }
#print dict_data
df_obj2 = pd.DataFrame(dict_data)
print(df_obj2.head())
   A          B    C  D       E            F
0  1.0 2016-12-17  1.0  3  Python  ChinaHadoop
1  1.0 2016-12-17  1.0  3    Java  ChinaHadoop
2  1.0 2016-12-17  1.0  3     C++  ChinaHadoop
3  1.0 2016-12-17  1.0  3      C#  ChinaHadoop
# 通过列索引获取列数据
print(df_obj2['A'])
print(type(df_obj2['A']))

print(df_obj2.A)
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64
<class 'pandas.core.series.Series'>
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64
# 增加列
df_obj2['G'] = df_obj2['D'] + 4
print(df_obj2.head())
    A          B    C  D       E            F  G
0  1.0 2016-12-17  1.0  3  Python  ChinaHadoop  7
1  1.0 2016-12-17  1.0  3    Java  ChinaHadoop  7
2  1.0 2016-12-17  1.0  3     C++  ChinaHadoop  7
3  1.0 2016-12-17  1.0  3      C#  ChinaHadoop  7
# 删除列
del(df_obj2['G'] )
print(df_obj2.head())
     A          B    C  D       E            F
0  1.0 2016-12-17  1.0  3  Python  ChinaHadoop
1  1.0 2016-12-17  1.0  3    Java  ChinaHadoop
2  1.0 2016-12-17  1.0  3     C++  ChinaHadoop
3  1.0 2016-12-17  1.0  3      C#  ChinaHadoop

1.3 Index

​ Series和DataFrame中的索引都是Index对象,不可变,保证了数据的安全。

​ 常见的 Index 种类:

index

Int64Index

MultiIndex , “层级”索引

DataTimeIndex , 时间戳类型

print(type(ser_obj.index))
print(type(df_obj2.index))

print(df_obj2.index)
<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.indexes.numeric.Int64Index'>
Int64Index([0, 1, 2, 3], dtype='int64')

2. Pandas 的数据操作

2.1 索引操作

​ 切片索引的注意,位置访问时,末尾边界不包含;名称访问时,末尾边界包含。

2.1.1 Series 索引
import pandas as pd

ser_obj = pd.Series(range(5), index = ['a', 'b', 'c', 'd', 'e'])
print(ser_obj.head())
a    0
b    1
c    2
d    3
e    4
dtype: int32
# 行索引
print(ser_obj['a'])
print(ser_obj[0])
0
0
# 切片索引
print(ser_obj[1:3])
print(ser_obj['b':'d'])
b    1
c    2
dtype: int32
b    1
c    2
d    3
dtype: int32
# 不连续索引
print(ser_obj[[0, 2, 4]])
print(ser_obj[['a', 'e']])
a    0
c    2
e    4
dtype: int32
a    0
e    4
dtype: int32
# 布尔索引
ser_bool = ser_obj > 2
print(ser_bool)
print(ser_obj[ser_bool])

print(ser_obj[ser_obj > 2])
a    False
b    False
c    False
d     True
e     True
dtype: bool
d    3
e    4
dtype: int32
d    3
e    4
dtype: int32
2.1.2 DataFrame 索引
import numpy as np

df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df_obj.head())
         a         b         c         d
0 -0.595692  0.813699 -0.551327 -0.059703
1  0.339194 -2.335579  0.230472 -0.680213
2 -0.252306  0.212406 -0.979523  0.408522
3  0.216677  0.574524 -0.819607  2.170009
4 -1.099175 -0.665488  0.391421 -0.400642
# 列索引
print('列索引')
print(df_obj['a']) # 返回Series类型
print(type(df_obj[[0]])) # 返回DataFrame类型

# 不连续索引
print('不连续索引')
print(df_obj[['a','c']])
print(df_obj[[1, 3]])
列索引
0   -0.595692
1    0.339194
2   -0.252306
3    0.216677
4   -1.099175
Name: a, dtype: float64
<class 'pandas.core.frame.DataFrame'>
不连续索引
          a         c
0 -0.595692 -0.551327
1  0.339194  0.230472
2 -0.252306 -0.979523
3  0.216677 -0.819607
4 -1.099175  0.391421
          b         d
0  0.813699 -0.059703
1 -2.335579 -0.680213
2  0.212406  0.408522
3  0.574524  2.170009
4 -0.665488 -0.400642
2.1.3 三种索引方式(标签索引,整型位置索引,混合索引)
# 标签索引 loc
# Series
print(ser_obj['b':'d'])
print(ser_obj.loc['b':'d'])

# DataFrame
print(df_obj['a'])
print(df_obj.loc[0:2, 'a'])
b    1
c    2
d    3
dtype: int32
b    1
c    2
d    3
dtype: int32
0   -0.595692
1    0.339194
2   -0.252306
3    0.216677
4   -1.099175
Name: a, dtype: float64
0   -0.595692
1    0.339194
2   -0.252306
Name: a, dtype: float64
# 整型位置索引 iloc
print(ser_obj[1:3])
print(ser_obj.iloc[1:3])

# DataFrame
print(df_obj.iloc[0:2, 0]) # 注意和df_obj.loc[0:2, 'a']的区别
b    1
c    2
dtype: int32
b    1
c    2
dtype: int32
0   -0.595692
1    0.339194
Name: a, dtype: float64
# 混合索引 ix
print(ser_obj.ix[1:3])
print(ser_obj.ix['b':'c'])

# DataFrame
print(df_obj.ix[0:2, 0]) # 先按标签索引尝试操作,然后再按位置索引尝试操作
b    1
c    2
dtype: int32
b    1
c    2
dtype: int32
0   -0.595692
1    0.339194
2   -0.252306
Name: a, dtype: float64

2.2 运算与对齐

s1 = pd.Series(range(10, 20), index = range(10))
s2 = pd.Series(range(20, 25), index = range(5))

print('s1: ' )
print(s1)

print('') 

print('s2: ')
print(s2)
s1: 
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

s2: 
0    20
1    21
2    22
3    23
4    24
dtype: int32
# Series 对齐运算
s1 + s2
0    30.0
1    32.0
2    34.0
3    36.0
4    38.0
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
dtype: float64
import numpy as np

df1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b'])
df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c'])

print('df1: ')
print(df1)

print('') 
print('df2: ')
print(df2)
df1: 
     a    b
0  1.0  1.0
1  1.0  1.0

df2: 
     a    b    c
0  1.0  1.0  1.0
1  1.0  1.0  1.0
2  1.0  1.0  1.0
# DataFrame对齐操作
df1 + df2
abc
02.02.0NaN
12.02.0NaN
2NaNNaNNaN
# 填充未对齐的数据进行运算
print(s1)
print(s2)

s1.add(s2, fill_value = -1)
0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64
0    20
1    21
2    22
3    23
4    24
dtype: int64
Out[9]:
0    30.0
1    32.0
2    34.0
3    36.0
4    38.0
5    14.0
6    15.0
7    16.0
8    17.0
9    18.0
dtype: float64
df1.sub(df2, fill_value = 2.)
abc
00.00.01.0
10.00.01.0
21.01.01.0
# 填充NaN
s3 = s1 + s2
print(s3)
0    30.0
1    32.0
2    34.0
3    36.0
4    38.0
5    -1.0
6    -1.0
7    -1.0
8    -1.0
9    -1.0
dtype: float64
df3 = df1 + df2
print(df3)
     a    b   c
0  2.0  2.0 NaN
1  2.0  2.0 NaN
2  NaN  NaN NaN
df3.fillna(100, inplace = True)
print(df3)
       a      b      c
0    2.0    2.0  100.0
1    2.0    2.0  100.0
2  100.0  100.0  100.0

2.3 函数应用

# Numpy ufunc 函数
df = pd.DataFrame(np.random.randn(5,4) - 1)
print(df)

print(np.abs(df))
          0         1         2         3
0 -2.193022 -2.090432 -2.288651 -0.026022
1 -0.720957 -1.501025 -1.734828 -1.858286
2  0.300216 -3.391127 -0.872570 -0.686669
3 -2.552131 -1.452268 -1.188845 -0.597845
4  2.111044 -1.203676 -1.143487 -0.542755
          0         1         2         3
0  2.193022  2.090432  2.288651  0.026022
1  0.720957  1.501025  1.734828  1.858286
2  0.300216  3.391127  0.872570  0.686669
3  2.552131  1.452268  1.188845  0.597845
4  2.111044  1.203676  1.143487  0.542755
# 使用apply应用行或列数据 , 不指定轴方向时默认为列
#f = lambda x : x.max()
print(df.apply(lambda x : x.max()))
0    2.111044
1   -1.203676
2   -0.872570
3   -0.026022
dtype: float64
# 指定轴方向 , axis=1时,为行运算
print(df.apply(lambda x : x.max(), axis=1))
# 使用applymap应用到每个数据
f2 = lambda x : '%.2f' % x
print(df.applymap(f2))
       0      1      2      3
0  -2.19  -2.09  -2.29  -0.03
1  -0.72  -1.50  -1.73  -1.86
2   0.30  -3.39  -0.87  -0.69
3  -2.55  -1.45  -1.19  -0.60
4   2.11  -1.20  -1.14  -0.54

排序

s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
print(s4)
4    10
3    11
1    12
4    13
4    14
dtype: int32
# 索引排序
s4.sort_index()
1    12
3    11
4    10
4    13
4    14
dtype: int32
df4 = pd.DataFrame(np.random.randn(3, 4), 
                   index=np.random.randint(3, size=3),
                   columns=np.random.randint(4, size=4))
print(df4)
          3         2         2         1
2  0.244068 -1.977220  0.045238 -2.064546
2  0.218196 -0.419284 -0.698839  0.241649
2  0.296747 -0.021311  0.225724 -0.325439
#df4.sort_index(ascending=False)
df4.sort_index(axis=1)
1223
2-2.064546-1.9772200.0452380.244068
20.241649-0.419284-0.6988390.218196
2-0.325439-0.0213110.2257240.296747
# 按值排序
df4.sort_values(by=1)
3221
20.244068-1.9772200.045238-2.064546
20.296747-0.0213110.225724-0.325439
20.218196-0.419284-0.6988390.241649

2.4 处理缺失数据

df_data = pd.DataFrame([np.random.randn(3), [1., np.nan, np.nan],
                       [4., np.nan, np.nan], [1., np.nan, 2.]])
df_data.head()
012
00.1067631.521225-0.654447
11.000000NaNNaN
24.000000NaNNaN
31.000000NaN2.000000
# isnull
df_data.isnull()
012
0FalseFalseFalse
1FalseTrueTrue
2FalseTrueTrue
3FalseTrueFalse
# dropna
df_data.dropna()
#df_data.dropna(axis=1)
012
00.1067631.521225-0.654447
# dropna
#df_data.dropna()
df_data.dropna(axis=1)
0
00.106763
11.000000
24.000000
31.000000
# fillna
df_data.fillna(-100.)
012
01.6194630.548047-1.027003
11.000000-100.000000-100.000000
24.000000-100.000000-100.000000
31.000000-100.0000002.000000

3. Pandas 统计计算和描述

import numpy as np
import pandas as pd
3.1 常用的统计计算
df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
df_obj
abcd
0-0.7675570.6371791.6501741.046862
11.6756050.438124-0.119837-0.637159
2-0.3678491.0445640.977799-0.185916
3-0.453807-0.786276-1.9172000.502135
40.230846-0.0564621.6126590.426120
df_obj.sum()   #默认按列计算
a    0.317239
b    1.277129
c    2.203596
d    1.152041
dtype: float64
df_obj.max()
a    1.675605
b    1.044564
c    1.650174
d    1.046862
dtype: float64
df_obj.min(axis=1)   #定义轴方向,为行
0   -0.767557
1   -0.637159
2   -0.367849
3   -1.917200
4   -0.056462
dtype: float64
3.2 统计描述
df_obj.describe()
abcd
count5.0000005.0000005.0000005.000000
mean0.882532-0.5746130.004122-0.455327
std1.0520450.8871150.4564360.646042
min-0.713423-1.600219-0.628493-1.403887
25%0.715594-1.483364-0.211933-0.664303
50%0.783228-0.0531380.021936-0.424985
75%1.5807040.1233220.276417-0.103682
max2.0465560.1403330.5626830.320219

4. 数据的分组与聚合

4.1 分组(groupby)

​ 对数据集进行分组,然后对每组进行统计分析,pandas能利用 groupby 进行更加复杂的分组运算。

​ 分组运算过程:

​ split -> apply -> combine

​ 拆分:进行分组的依据

​ 应用:每个分组运行的计算规则

​ 合并:把每个分组的的计算结果合并起来

4.1.1 groupby对象
import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
                      'a', 'b', 'a', 'a'],
            'key2' : ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
            'data1': np.random.randn(8),
            'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)
    data1     data2 key1   key2
0 -0.943078  0.820645    a    one
1 -1.429043  0.142617    b    one
2  0.832261  0.843898    a    two
3  0.906262  0.688165    b  three
4  0.541173  0.117232    a    two
5 -0.213385 -0.098734    b    two
6 -1.291468 -1.186638    a    one
7  1.186941  0.809122    a  three
# dataframe根据key1进行分组
print(type(df_obj.groupby('key1')))
<class 'pandas.core.groupby.DataFrameGroupBy'>
# data1列根据key1进行分组
print(type(df_obj['data1'].groupby(df_obj['key1'])))
<class 'pandas.core.groupby.SeriesGroupBy'>
# 分组运算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())
         data1     data2
key1                    
a     0.065166  0.280852
b    -0.245389  0.244016
key1
a    0.065166
b   -0.245389
Name: data1, dtype: float64
# size
print(grouped1.size())
print(grouped2.size())
key1
a    5
b    3
dtype: int64
key1
a    5
b    3
dtype: int64
# 按列名分组
df_obj.groupby('key1')


Out[10]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002091C820550>
# 按自定义key分组,列表
self_def_key = [1, 1, 2, 2, 2, 1, 1, 1]
df_obj.groupby(self_def_key).size()

结果:
1    5
2    3
dtype: int64
# 按自定义key分组,多层列表
df_obj.groupby([df_obj['key1'], df_obj['key2']]).size()
key1  key2 
a     one      2
      three    1
      two      2
b     one      1
      three    1
      two      1
dtype: int64
# 按多个列多层分组
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())
key1  key2 
a     one      2
      three    1
      two      2
b     one      1
      three    1
      two      1
dtype: int64
# 多层分组按key的顺序进行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
print()
print(grouped3.mean().unstack())
              data1     data2
key2  key1                    
one   a    -1.117273 -0.182997
      b    -1.429043  0.142617
three a     1.186941  0.809122
      b     0.906262  0.688165
two   a     0.686717  0.480565
      b    -0.213385 -0.098734

          data1               data2          
key1          a         b         a         b
key2                                         
one   -1.117273 -1.429043 -0.182997  0.142617
three  1.186941  0.906262  0.809122  0.688165
two    0.686717 -0.213385  0.480565 -0.098734
4.1.2 GroupBy 对象分组迭代
# 单层分组
for group_name, group_data in grouped1:
    print(group_name)
    print(group_data)
a
      data1     data2 key1   key2
0 -0.943078  0.820645    a    one
2  0.832261  0.843898    a    two
4  0.541173  0.117232    a    two
6 -1.291468 -1.186638    a    one
7  1.186941  0.809122    a  three
b
      data1     data2 key1   key2
1 -1.429043  0.142617    b    one
3  0.906262  0.688165    b  three
5 -0.213385 -0.098734    b    two
# 多层分组
for group_name, group_data in grouped2:
    print(group_name)
    print(group_data)
('a', 'one')
      data1     data2 key1 key2
0 -0.943078  0.820645    a  one
6 -1.291468 -1.186638    a  one
('a', 'three')
      data1     data2 key1   key2
7  1.186941  0.809122    a  three
('a', 'two')
      data1     data2 key1 key2
2  0.832261  0.843898    a  two
4  0.541173  0.117232    a  two
('b', 'one')
      data1     data2 key1 key2
1 -1.429043  0.142617    b  one
('b', 'three')
      data1     data2 key1   key2
3  0.906262  0.688165    b  three
('b', 'two')
      data1     data2 key1 key2
5 -0.213385 -0.098734    b  two
# GroupBy对象转换list
list(grouped1)
[('a',       data1     data2 key1   key2
  0 -0.943078  0.820645    a    one
  2  0.832261  0.843898    a    two
  4  0.541173  0.117232    a    two
  6 -1.291468 -1.186638    a    one
  7  1.186941  0.809122    a  three), ('b',       data1     data2 key1   key2
  1 -1.429043  0.142617    b    one
  3  0.906262  0.688165    b  three
  5 -0.213385 -0.098734    b    two)]
# GroupBy对象转换dict
dict(list(grouped1))
{'a':       data1     data2 key1   key2
 0 -0.943078  0.820645    a    one
 2  0.832261  0.843898    a    two
 4  0.541173  0.117232    a    two
 6 -1.291468 -1.186638    a    one
 7  1.186941  0.809122    a  three, 'b':       data1     data2 key1   key2
 1 -1.429043  0.142617    b    one
 3  0.906262  0.688165    b  three
 5 -0.213385 -0.098734    b    two}
# 按列分组
print(df_obj.dtypes)

# 按数据类型分组
df_obj.groupby(df_obj.dtypes, axis=1).size()
df_obj.groupby(df_obj.dtypes, axis=1).sum()
data1    float64
data2    float64
key1      object
key2      object
dtype: object
float64object
0-0.122433aone
1-1.286426bone
21.676158atwo
31.594427bthree
40.658404atwo
5-0.312119btwo
6-2.478106aone
71.996064athree
4.1.2 其他分组方法
df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
                       columns=['a', 'b', 'c', 'd', 'e'],
                       index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
df_obj2
abcde
A11.01.06.05
B2NaNNaNNaN6
C55.07.05.07
D28.05.06.02
E51.04.04.04
# 通过字典分组
mapping_dict = {'a':'python', 'b':'python', 'c':'java', 'd':'C', 'e':'java'}
df_obj2.groupby(mapping_dict, axis=1).size()
df_obj2.groupby(mapping_dict, axis=1).count() # 非NaN的个数
df_obj2.groupby(mapping_dict, axis=1).sum()
Cjavapython
A6.06.02.0
BNaN6.02.0
C5.014.010.0
D6.07.010.0
E4.08.06.0
# 通过函数分组
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
                       columns=['a', 'b', 'c', 'd', 'e'],
                       index=['AA', 'BBB', 'CC', 'D', 'EE'])
#df_obj3

def group_key(idx):
    """
        idx 为列索引或行索引
    """
    #return idx
    return len(idx)

df_obj3.groupby(group_key).size()

# 以上自定义函数等价于
#df_obj3.groupby(len).size()
1    1
2    3
3    1
dtype: int64
# 通过索引级别分组
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
                                     ['A', 'A', 'B', 'C', 'B']], names=['language', 'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
df_obj4
languagePythonJavaPythonJavaPython
indexAABCB
016472
197224
239975
316166
451736
# 根据language进行分组
df_obj4.groupby(level='language', axis=1).sum()
df_obj4.groupby(level='index', axis=1).sum()
indexABC
0767
11662
212147
3776
46133

4.2 聚合

​ 对分组后的数据进行运算 , 数组产生标量的过程,如mean()、count()等。

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
                      'a', 'b', 'a', 'a'],
            'key2' : ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
            'data1': np.random.randint(1,10, 8),
            'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)

out:
   data1  data2 key1   key2
0      4      2    a    one
1      7      1    b    one
2      2      8    a    two
3      9      4    b  three
4      3      2    a    two
5      8      5    b    two
6      6      8    a    one
7      9      3    a  three
# 内置的聚合函数
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())
 data1  data2
key1              
a        24     23
b        24     10
      data1  data2 key2
key1                   
a         9      8  two
b         9      5  two
      data1  data2 key2
key1                   
a         2      2  one
b         7      1  one
      data1     data2
key1                 
a       4.8  4.600000
b       8.0  3.333333
key1
a    5
b    3
dtype: int64
      data1  data2  key2
key1                    
a         5      5     5
b         3      3     3
               data1     data2
key1                          
a    count  5.000000  5.000000
     mean   4.800000  4.600000
     std    2.774887  3.130495
     min    2.000000  2.000000
     25%    3.000000  2.000000
     50%    4.000000  3.000000
     75%    6.000000  8.000000
     max    9.000000  8.000000
b    count  3.000000  3.000000
     mean   8.000000  3.333333
     std    1.000000  2.081666
     min    7.000000  1.000000
     25%    7.500000  2.500000
     50%    8.000000  4.000000
     75%    8.500000  4.500000
     max    9.000000  5.000000
# 自定义聚合函数
def peak_range(df):
    """
        返回数值范围
    """
    #print type(df) #参数为索引所对应的记录
    return df.max() - df.min()

print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))
   data1  data2
key1              
a         7      6
b         2      4
         data1     data2
key1                    
a     2.478410  2.030536
b     2.335305  0.786899
# 应用多个聚合函数

# 同时应用多个聚合函数
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默认列名为函数名
         data1                                data2                           
          mean       std count peak_range      mean       std count peak_range
key1                                                                          
a     0.065166  1.110226     5   2.478410  0.280852  0.875752     5   2.030536
b    -0.245389  1.167982     3   2.335305  0.244016  0.403130     3   0.786899
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通过元组提供新的列名
         data1                               data2                          
          mean       std count     range      mean       std count     range
key1                                                                        
a     0.065166  1.110226     5  2.478410  0.280852  0.875752     5  2.030536
b    -0.245389  1.167982     3  2.335305  0.244016  0.403130     3  0.786899
# 每列作用不同的聚合函数
dict_mapping = {'data1':'mean',
                'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
         data2     data1
key1                    
a     1.404259  0.065166
b     0.732047 -0.245389
dict_mapping = {'data1':['mean','max'],
                'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
         data2     data1          
           sum      mean       max
key1                              
a     1.404259  0.065166  1.186941
b     0.732047 -0.245389  0.906262

4.3 数据分组运算

import pandas as pd
import numpy as np

# 分组运算后保持shape
dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
                      'a', 'b', 'a', 'a'],
            'key2' : ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
            'data1': np.random.randint(1, 10, 8),
            'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
df_obj
key1key2data1data2
0aone39
1bone45
2atwo89
3bthree48
4atwo52
5btwo32
6aone49
7athree94
# 按key1分组后,计算data1,data2的统计信息并附加到原始表格中
k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_')
k1_sum
sum_data1sum_data2
key1
a2933
b1115
# 方法1,使用merge
pd.merge(df_obj, k1_sum, left_on='key1', right_index=True)
key1key2data1data2sum_data1sum_data2
0aone392933
2atwo892933
4atwo522933
6aone492933
7athree942933
1bone451115
3bthree481115
5btwo321115
4.3.1 transform 方法
# 方法2,使用transform
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
df_obj[k1_sum_tf.columns] = k1_sum_tf
df_obj
key1key2data1data2sum_key2sum_data1sum_data2
0aone39onetwotwoonethree2933
1bone45onethreetwo1115
2atwo89onetwotwoonethree2933
3bthree48onethreetwo1115
4atwo52onetwotwoonethree2933
5btwo32onethreetwo1115
6aone49onetwotwoonethree2933
7athree94onetwotwoonethree2933
# 自定义函数传入transform
def diff_mean(s):
    """
        返回数据与均值的差值
    """
    return s - s.mean()

df_obj.groupby('key1').transform(diff_mean)
data1data2sum_data1sum_data2
0-2.8000002.400
10.3333330.000
22.2000002.400
30.3333333.000
4-0.800000-4.600
5-0.666667-3.000
6-1.8000002.400
73.200000-2.600
dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek', 
                                             'TotalHours', 'APM'])
4.3.2 apply
def top_n(df, n=3, column='APM'):
    """
        返回每个分组按 column 的 top n 数据
    """
    return df.sort_values(by=column, ascending=False)[:n]

df_data.groupby('LeagueIndex').apply(top_n)
LeagueIndexAgeHoursPerWeekTotalHoursAPM
LeagueIndex
12214120.012.0730.0172.9530
2246127.08.0250.0141.6282
1753120.028.0100.0139.6362
23062220.06.0100.0179.6250
3229216.024.0110.0156.7380
1520229.06.0250.0151.6470
31557322.06.0200.0226.6554
484319.042.0450.0220.0692
2883316.08.0800.0208.9500
42688426.024.0990.0249.0210
1759416.06.075.0229.9122
2637423.024.0650.0227.2272
53277518.016.0950.0372.6426
93517.036.0720.0335.4990
202537.014.0800.0327.7218
6734616.028.0730.0389.8314
2746616.028.04000.0350.4114
1810621.014.0730.0323.2506
73127723.042.02000.0298.7952
104721.024.01000.0286.4538
1654718.098.0700.0236.0316
833938NaNNaNNaN375.8664
33738NaNNaNNaN364.8504
33728NaNNaNNaN355.3518
# apply函数接收的参数会传入自定义的函数中
df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age')
LeagueIndexAgeHoursPerWeekTotalHoursAPM
LeagueIndex
13146140.012.0150.038.5590
3040139.010.0500.029.8764
2920243.010.0730.086.0586
2437241.04.0200.054.2166
31258341.014.0800.077.6472
2972340.010.0500.060.5970
41696444.06.0500.089.5266
1729439.08.0500.086.7246
5202537.014.0800.0327.7218
2745537.018.01000.0123.4098
63069631.08.0800.0133.1790
2706631.08.0700.066.9918
72813726.036.01300.0188.5512
1992726.024.01000.0219.6690
833408NaNNaNNaN189.7404
4.3.3 禁止分组 group_keys = False
df_data.groupby('LeagueIndex', group_keys=False).apply(top_n)
LeagueIndexAgeHoursPerWeekTotalHoursAPM
2214120.012.0730.0172.9530
2246127.08.0250.0141.6282
1753120.028.0100.0139.6362
3062220.06.0100.0179.6250
3229216.024.0110.0156.7380
1520229.06.0250.0151.6470
1557322.06.0200.0226.6554
484319.042.0450.0220.0692
2883316.08.0800.0208.9500
2688426.024.0990.0249.0210
1759416.06.075.0229.9122
2637423.024.0650.0227.2272
3277518.016.0950.0372.6426
93517.036.0720.0335.4990
202537.014.0800.0327.7218
734616.028.0730.0389.8314
2746616.028.04000.0350.4114
1810621.014.0730.0323.2506
3127723.042.02000.0298.7952
104721.024.01000.0286.4538
1654718.098.0700.0236.0316
33938NaNNaNNaN375.8664
33738NaNNaNNaN364.8504
33728NaNNaNNaN355.3518

5. 数据清洗、合并、转化和重构

5.1 数据清洗

5.2 数据连接

import pandas as pd
import numpy as np

df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                        'data2' : np.random.randint(0,10,3)})

print(df_obj1)
print(df_obj2)
  data1 key
0      5   b
1      9   b
2      1   a
3      0   c
4      3   a
5      9   a
6      0   b
   data2 key
0      9   a
1      3   b
2      8   d
# 默认将重叠列的列名作为“外键”进行连接
pd.merge(df_obj1, df_obj2)
data1keydata2
05b3
19b3
20b3
31a9
43a9
59a9
# on显示指定“外键”
pd.merge(df_obj1, df_obj2, on='key')
data1keydata2
05b3
19b3
20b3
31a9
43a9
59a9
# left_on,right_on分别指定左侧数据和右侧数据的“外键”

# 更改列名
df_obj1 = df_obj1.rename(columns={'key':'key1'})
df_obj2 = df_obj2.rename(columns={'key':'key2'})
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2')
data1key1data2key2
05b3b
19b3b
20b3b
31a9a
43a9a
59a9a
# “外连接”
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer')
data1key1data2key2
05.0b3.0b
19.0b3.0b
20.0b3.0b
31.0a9.0a
43.0a9.0a
59.0a9.0a
60.0cNaNNaN
7NaNNaN8.0d
# 左连接
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left')
data1key1data2key2
05b3.0b
19b3.0b
21a9.0a
30cNaNNaN
43a9.0a
59a9.0a
60b3.0b
# 右连接
pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right')
data1key1data2key2
05.0b3b
19.0b3b
20.0b3b
31.0a9a
43.0a9a
59.0a9a
6NaNNaN8d
# 处理重复列名
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                        'data' : np.random.randint(0,10,3)})

pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right'))
data_leftkeydata_right
04b5
12b5
25b5
39a7
46a7
56a7
# 按索引连接
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
pd.merge(df_obj1, df_obj2, left_on='key', right_index=True)
data1keydata2
06b5
17b5
69b5
22a3
44a3
54a3

5.3 数据合并

import numpy as np
import pandas as pd
5.3.1 Numpy 的 concat
arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))

print(arr1)
print(arr2)
[[8 5 2 4]
 [2 9 9 0]
 [6 2 1 3]]
[[0 4 5 3]
 [6 4 1 3]
 [1 2 9 9]]
np.concatenate([arr1, arr2])
array([[8, 5, 2, 4],
       [2, 9, 9, 0],
       [6, 2, 1, 3],
       [0, 4, 5, 3],
       [6, 4, 1, 3],
       [1, 2, 9, 9]])
np.concatenate([arr1, arr2], axis=1)
array([[8, 5, 2, 4, 0, 4, 5, 3],
       [2, 9, 9, 0, 6, 4, 1, 3],
       [6, 2, 1, 3, 1, 2, 9, 9]])
5.3.2 Series 上的concat
# index 没有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))

print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
0    9
1    9
2    5
3    6
4    4
dtype: int32
5    1
6    9
7    2
8    5
dtype: int32
9     4
10    7
11    4
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3])

Out[6]:
0     9
1     9
2     5
3     6
4     4
5     1
6     9
7     2
8     5
9     4
10    7
11    4
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1)
012
09.0NaNNaN
19.0NaNNaN
25.0NaNNaN
36.0NaNNaN
44.0NaNNaN
5NaN1.0NaN
6NaN9.0NaN
7NaN2.0NaN
8NaN5.0NaN
9NaNNaN4.0
10NaNNaN7.0
11NaNNaN4.0
# index 有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))

print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
0    7
1    3
2    9
3    1
4    7
dtype: int32
0    6
1    1
2    4
3    7
dtype: int32
0    7
1    2
2    3
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3])
0    7
1    3
2    9
3    1
4    7
0    6
1    1
2    4
3    7
0    7
1    2
2    3
dtype: int32
pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner')
012
0767
1312
2943
5.3.3 DataFrame上的concat
df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],
                       columns=['A', 'B'])
df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],
                       columns=['C', 'D'])
print(df_obj1)
print(df_obj2)
   A  B
a  1  6
b  1  0
c  1  6
   C  D
a  2  1
b  7  4
pd.concat([df_obj1, df_obj2])
ABCD
a1.06.0NaNNaN
b1.00.0NaNNaN
c1.06.0NaNNaN
aNaNNaN2.01.0
bNaNNaN7.04.0
pd.concat([df_obj1, df_obj2], axis=1)
ABCD
a162.01.0
b107.04.0
c16NaNNaN

5.4 数据重构

5.4.1 stack
import numpy as np
import pandas as pd

df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
df_obj
data1data2
008
195
211
385
494
stacked = df_obj.stack()
print(stacked)
0  data1    0
   data2    8
1  data1    9
   data2    5
2  data1    1
   data2    1
3  data1    8
   data2    5
4  data1    9
   data2    4
dtype: int32
# 默认操作内层索引
stacked.unstack()
data1data2
008
195
211
385
494
# 通过level指定操作索引的级别
stacked.unstack(level=0)
01234
data109189
data285154

5.5 数据转换

import numpy as np
import pandas as pd
5.5.1 重复数据
df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
                       'data2' : np.random.randint(0, 4, 8)})
df_obj
data1data2
0a1
1a3
2a3
3a3
4b3
5b3
6b1
7b1
df_obj.duplicated()
0    False
1    False
2     True
3     True
4    False
5     True
6    False
7     True
dtype: bool
df_obj.drop_duplicates()
data1data2
0a1
1a3
4b3
6b1
df_obj.drop_duplicates('data2')
data1data2
0a1
1a3
5.5.2 map函数
ser_obj = pd.Series(np.random.randint(0,10,10))
ser_obj
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
ser_obj.map(lambda x : x ** 2)
0    36
1    25
2    16
3    16
4    64
5     1
6    36
7    25
8    16
9     1
dtype: int64
5.5.4 数据替换repalce
# 替换单个值
ser_obj.replace(0, -100)
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], -100)
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], [-100, -200])
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32

| —: | ----: | ----: |
| 0 | 0 | 8 |
| 1 | 9 | 5 |
| 2 | 1 | 1 |
| 3 | 8 | 5 |
| 4 | 9 | 4 |

# 通过level指定操作索引的级别
stacked.unstack(level=0)
01234
data109189
data285154

5.5 数据转换

import numpy as np
import pandas as pd
5.5.1 重复数据
df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
                       'data2' : np.random.randint(0, 4, 8)})
df_obj
data1data2
0a1
1a3
2a3
3a3
4b3
5b3
6b1
7b1
df_obj.duplicated()
0    False
1    False
2     True
3     True
4    False
5     True
6    False
7     True
dtype: bool
df_obj.drop_duplicates()
data1data2
0a1
1a3
4b3
6b1
df_obj.drop_duplicates('data2')
data1data2
0a1
1a3
5.5.2 map函数
ser_obj = pd.Series(np.random.randint(0,10,10))
ser_obj
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
ser_obj.map(lambda x : x ** 2)
0    36
1    25
2    16
3    16
4    64
5     1
6    36
7    25
8    16
9     1
dtype: int64
5.5.4 数据替换repalce
# 替换单个值
ser_obj.replace(0, -100)
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], -100)
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
# 替换多个值
ser_obj.replace([0, 2], [-100, -200])
0    6
1    5
2    4
3    4
4    8
5    1
6    6
7    5
8    4
9    1
dtype: int32
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值