pandas——基础篇

最新推荐文章于 2022-08-13 20:45:33 发布

Dis_illusion

最新推荐文章于 2022-08-13 20:45:33 发布

阅读量861

点赞数 2

分类专栏： pandas 文章标签： pandas 索引操作函数 python

本文链接：https://blog.csdn.net/qq_36733722/article/details/103035805

版权

pandas 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

简介

Pandas 是 Python 的核心数据分析支持库，提供了快速、灵活、明确的数据结构，旨在简单、直观地处理关系型、标记型数据。Pandas 的目标是成为 Python 数据分析实践与实战的必备高级工具，其长远目标是成为最强大、最灵活、可以支持任何语言的开源数据分析工具。经过多年不懈的努力，Pandas 离这个目标已经越来越近了。

使用

我们仅需要简单的通过import pandas as pd就可以使用pa 了。

In [2]: import pandas as pd

In [3]: df = pd.DataFrame()

In [4]: df
Out[4]:
Empty DataFrame
Columns: []
Index: []

数据结构

名称	维数	描述
Series	1	带标签的一维同构数组
DataFrame	2	带标签的，大小可变的，二维异构表格

Series
Series是一维标记的数组，它包含了一个值序列和数据标签(index)。
DataFrame
DataFrame表示的是矩阵的数据表，它包含已排序的列集合，每一列可以是不同的值类型。DataFrame既有行索引也有列索引，它可以被视为一个共享索引的Series的字典。

基础操作

Series

我们可以通过数组，就可以简单的创建Series：

In [3]: a = pd.Series([5,-3,7,1.4])
In [4]: a
Out[4]:
0    5.0
1   -3.0
2    7.0
3    1.4
dtype: float64

以上面的代码为例，可以看到Series的字符串格式是左边是索引（index），右边是值（values）。我们可以通过Series的values和index属性

In [5]: a.index
Out[5]: RangeIndex(start=0, stop=4, step=1)

In [6]: a.values
Out[6]: array([ 5. , -3. ,  7. ,  1.4])

如果我们不想使用默认索引，也可以自己定义。下面的代码就自定义了索引。

In [8]: a = pd.Series([5,-3,7,1.4],index = ['a','b','c','d'])

In [9]: a
Out[9]:
a    5.0
b   -3.0
c    7.0
d    1.4
dtype: float64

和numpy一样，我们可以使用索引来对Series进行访问和修改。

# 单个索引
In [14]: a['a']
Out[14]: 5.0
#数组批量索引
In [15]: a[['a','c']]
Out[15]:
a    5.0
c    7.0
dtype: float64
# 修改
In [16]: a[['a','c']] = 5,9
In [17]: a
Out[17]:
a    5.0
b   -3.0
c    9.0
d    1.4
dtype: float64

除此之外，我们还可以使用类似numpy的mask来进行访问，以及和numpy一样，pandas的四则运算是批量操作的，免去了for循环。

# 使用mask来进行筛选
In [19]: a[a>2]
Out[19]:
a    5.0
c    9.0
dtype: float64

# 四则运算是作用在每一个元素上的
In [20]: a*2
Out[20]:
a    10.0
b    -6.0
c    18.0
d     2.8
dtype: float64

如果数据被存放在字典中，那么我们可以直接通过字典来创建Series。

In [28]: dic = {'a':1,'b':5,'asd':-5,'c':7}

In [29]: pd.Series(dic)
Out[29]:
a      1
b      5
asd   -5
c      7
dtype: int64

由于字典中的数据是无序的，因此传入Series时想要按照规定的顺序的话，可以自定义索引，当然如果给出的索引不在字典的key中的缺失值则会以Nan(not a number)补充。

In [31]:a = pd.Series(dic,index= ['a','b','c','d'])
In [32]:a
Out[32]:
a    1.0
b    5.0
c    7.0
d    NaN
dtype: float64

# 对于确实值的判断可以使用函数isnull和notnull来判断。
In [38]: a.isnull()
Out[38]:
a    False
b    False
c    False
d     True
dtype: bool

In [39]: a.notnull()
Out[39]:
a     True
b     True
c     True
d    False
dtype: bool

对于Series而言，最重要的一个功能是可以根据运算的索引标签自动对齐数据。什么意思呢？比如，对于Series,A和B的索引分别为['a','b','c']，['b','c','d']。则对A，B进行运算操作时，会自动按照索引对其。这有点类似数据库中的join操作。

In [4]: A = pd.Series([1,5,-7], index = ['a','b','c'])
In [5]: B = pd.Series([2,5,-2], index = ['b','c','d'])

In [6]: A+B
Out[6]:
a    NaN
b    7.0
c   -2.0
d    NaN
dtype: float64

Series对象本身及其索引都有一个name属性，这个功能在后续中还会提到。

In [13]: a = pd.Series([1,-2,4])
In [14]: a.name = 'num'
In [15]: a.index.name = 'ind'

In [16]: a
Out[16]:
ind
0    1
1   -2
2    4
Name: num, dtype: int64

如果你想修改索引，可以通过赋值的方式，就像这样,由于修改了索引，所以索引名同时也不存在了。

In [17]: a.index = ['a','b','c']

In [18]: a
Out[18]:
a    1
b   -2
c    4
Name: num, dtype: int64

DataFrame

DataFrame是一个表格形式的数据结构，你可以将它理解为由不同Series组成的共用同一个索引的字典。
我们可以通过一下几种方法来建立DataFrame:

# 传入一个由等长列表或numpy数组
In [6]: dic = {'name':['zhao','qian','sun'],
   ...:         'old':[20,18,19],
   ...:         'sex':['male','female','male']}
In [7]: df = pd.DataFrame(dic)

In [8]: df
Out[8]:
   name  old     sex
0  zhao   20    male
1  qian   18  female
2   sun   19    male

# 嵌套字典方式创建，空缺部分以NaN替代
In [52]: dic = {'name':{1:'zhang',2:'li'},'age':{1:24,2:23,0:19}}

In [53]: pd.DataFrame(dic)
Out[53]:
    name  age
1  zhang   24
2     li   23
0    NaN   19

如果Dataframe中的数据过大，我们不想全部显示，只想查看一些数据格式，这时候可以使用head和tail来显示前五行和后五行数据（这里的df只有三行）

In [12]: df.head()
Out[12]:
   name  old     sex
0  zhao   20    male
1  qian   18  female
2   sun   19    male

In [13]: df.tail()
Out[13]:
   name  old     sex
0  zhao   20    male
1  qian   18  female
2   sun   19    male

通过字典去创建Dataframe时，我们也可以以指定的列进行排列(未找到相应字典key则会以NaN替代 )，也可以像Series那样自定义索引；

In [16]: pd.DataFrame(dic,columns = ['old','name','sex','time'],index = np.arang
    ...: e(1,4))
Out[16]:
   old  name     sex   time
1   20  zhao    male    NaN
2   18  qian  female	NaN
3   19   sun    male	NaN

读取列
类似于字典或属性的方式，我们可以读取DataFrame中的一列或几列。

In [17]: df['name']
Out[17]:
0    zhao
1    qian
2     sun
Name: name, dtype: object

In [18]: df.sex
Out[18]:
0      male
1    female
2      male
Name: sex, dtype: object

In [19]: df[['name','sex']]
Out[19]:
   name     sex
0  zhao    male
1  qian  female
2   sun    male

读取行
对于行，可以使用loc属性来进行读取。

In [26]: df.loc[1]
Out[26]:
name      qian
old         18
sex     female
Name: 1, dtype: object

既然能够去读到DataFrame中的数据，相应的我们也能够加以修改。

In [31]: df['old'] = [22,19,17]
In [32]: df.sex = pd.Series(['male','male','male'])
In [33]: df.loc[1] = ['li',14,'femal']

In [34]: df
Out[34]:
   name  old    sex
0  zhao   22   male
1    li   14  femal
2   sun   17   male

如果我们想修改、插入、删除一行或一列元素时，该怎么做呢？

se = pd.Series(['asd','asw','df'])
# 添加一行
In [37]: df.append(se,ignore_index=True)
Out[37]:
   name   old    sex    0    1    2
0  zhao  22.0   male  NaN  NaN  NaN
1    li  14.0  femal  NaN  NaN  NaN
2   sun  17.0   male  NaN  NaN  NaN
3   NaN   NaN    NaN  asd  asw   df
# 添加一行
In [38]: se.name = 3
In [39]: df.append(se)
Out[39]:
      name   old    sex    0    1    2
0     zhao  22.0   male  NaN  NaN  NaN
1       li  14.0  femal  NaN  NaN  NaN
2      sun  17.0   male  NaN  NaN  NaN
3 	   NaN   NaN    NaN   asd  asw   df

# 添加一列，注意不能用df.test创建列
In [40]: df['test']=se
In [41]: df
Out[41]:
   name  old    sex test
0  zhao   22   male   as
1    li   14  femal  asw
2   sun   17   male   df
# 指定位置修改
In [42]: df['old']=pd.Series([10,24],index = [2,1])
In [43]: df
Out[43]:
   name   old    sex test
0  zhao   NaN   male   as
1    li  24.0  femal  asw
2   sun  10.0   male   df

# 删除一行元素
In [50]: del df['test']

In [51]: df
Out[51]:
   name   old    sex
0  zhao   NaN   male
1    li  24.0  femal
2   sun  10.0   male

&emsp我们也可以使用类似于numpy数组的方法，来对DataFrame进行转置；

In [55]: df.T
Out[55]:
         0      1     2
name  zhao     li   sun
old    NaN     24    10
sex   male  femal  male

如果设置了DataFrame的index和cloumns的name属性，则这些信息也会被显示出来：

In [57]: df
Out[57]:
   name   old    sex
0  zhao   NaN   male
1    li  24.0  femal
2   sun  10.0   male

# 设置行和列的名称
In [58]: df.index.name = 'num'
In [59]: df.columns.name = 'state'

In [60]: df
Out[60]:
state  name   old    sex
num
0      zhao   NaN   male
1        li  24.0  femal
2       sun  10.0   male

我们可以通过下面的方法来得到Dataframe的行，列标签以及值。

# 列
In [64]: df.index
Out[64]: RangeIndex(start=0, stop=3, step=1, name='num')
# 行
In [65]: df.columns
Out[65]: Index(['name', 'old', 'sex'], dtype='object', name='state')
# 值
In [66]: df.values
Out[66]:
array([['zhao', nan, 'male'],
       ['li', 24.0, 'femal'],
       ['sun', 10.0, 'male']], dtype=object)
# 可以通过labels来判断标签

值得注意的是，与python的集合不同，pandasd的Index可以包含重复的标签。

# 将DateFrame的index设置为相同
In [70]: df.index = [0,0,0]
In [71]: df
Out[71]:
state  name   old    sex
0      zhao   NaN   male
0        li  24.0  femal
0       sun  10.0   male

基本功能

reindex 重建索引

Series

pd.Series.reindex(self, index=None, **kwargs)

属性	含义
index	数组类型的新索引，基于原`Series`，没有的地方以`NaN`填充。
method	用于递增或递减索引填充空缺值{`None`（不填充空缺）, `backfill/bfill`（依据下一个填充上一个空缺）, `pad/ffill`（依据上一个值填充下一个空缺）, `nearest`（使用最近的值去填充空缺且索引仅支持数字）}
copy	默认为`True`即使传递的索引相同，也返回一个新对象。`False`时相当于返回的是原对象的视图。
level	在一个级别上广播，在传递的`MultiIndex`级别上匹配索引值。
fill_value	用于缺失值的值。默认为`NaN`，但可以是任何“兼容”值。
limit	限制最大填充数量。（选择后的最大值）

In [9]: s =pd.Series([2,7,3,-2])
# 使用index是在原Series上修改
In [10]: s.index = [1,2,3,4]
In [11]: s
Out[11]:
1    2
2    7
3    3
4   -2
dtype: int64
# reindex则是创建一个新索引的新对象
# 其中不存在的，则以NaN替代
In [12]: s.reindex([1,2,3,'a'])
Out[12]:
1    2.0
2    7.0
3    3.0
a    NaN
dtype: float64

我们可以使用fill_value默认值去填充空缺值，也可以使用method去参照上下存在的值进行填充空缺部分。

In [6]: a = pd.Series([ 1,  5,  8,  4, -2,  3,  7,  9, -4],
   ...:    index =['a','b','c','d','e','f','g','h','i'])
# 以设置的值去填充空缺值
In [7]: a.reindex(index = ['a','e','r','d'],fill_value = 0)
Out[7]:
a    1
e   -2
r    0
d    4
dtype: int64
# 依据上一个值填充下一个空缺值
In [8]: a.reindex(index = ['a','y','z','r','d'],method = 'ffill')
Out[8]:
a    1
y   -4
z   -4
r   -4
d    4
dtype: int64

诶？为什么用ffill填充的结果不是1而是-4呢？请记住填充参数method依据的是用于递增或递减索引填充空缺值对于原Series中是顺序递增的，因此，y,z,r的上一个有效值应该是i即-4。现在让我们看一下正确的使用方式：

In [4]: a = pd.Series([1,5,8,4],index = ['a','e','f','g'])
In [5]: b = pd.Series([1,5,8,4],index = [0,4,5,6])

# 使用ffill模式，依据上一个有效值填充下一个空缺值
In [6]: a.reindex(index = ['a','b','c','e','f'],method = 'ffill')
Out[6]:
a    1
b    1
c    1
e    5
f    8
dtype: int64

# 使用bfill模式，依据下一个有效值回填上一个空缺值
In [7]: a.reindex(index = ['a','b','c','e','f'],method = 'bfill')
Out[7]:
a    1
b    5
c    5
e    5
f    8
dtype: int64

# 使用nearest模式，依据最近的有效值去填充空缺值，
# 当空缺值举例两边举例相同时选择依据bfill填充。如下索引2
In [8]: b.reindex(index = [0,1,2,3,4,5],method = 'nearest')
Out[8]:
0    1
1    1
2    5
3    5
4    5
5    8
dtype: int64

#使用limit限制最大填充数量。
In [8]: b.reindex(index = [0,1,2,3,4,5],method ='nearest',limit=1)
Out[8]:
0    1
1    1
2    NaN
3    5
4    5
5    8
dtype: int64

对于copy，默认的是True返回的是一个原Series的一个副本对象（即使传递的索引相同，也返回一个新对象）。False时相当于返回的是原对象的视图。因此对于copy = True的返回值进行修改时，并不会导致原数据发生变化，但对于copy = True进行修改时则会导致原数据也发生变化。

In [15]: a = pd.Series([5,1,-7,3])
In [16]: copy_true = a.reindex(np.arange(1,5),copy = True)
In [17]: copy_false = a.reindex(np.arange(1,5),copy = False)
# 修改copy_true可以发现原数据不改变
In [18]: copy_true[2] = 999
In [19]: copy_true
Out[19]:
1      1.0
2    999.0
3      3.0
4      NaN
dtype: float64

In [20]: a
Out[20]:
0    5
1    1
2   -7
3    3
dtype: int64
#修改copy_false可以发现原数据改变
In [21]: copy_false[2] = 999
In [22]: copy_false
Out[22]:
1      1.0
2    999.0
3      3.0
4      NaN
dtype: float64

In [23]: a
Out[23]:
0    5
1    1
2   -7
3    3
dtype: int64

DataFrame

pd.DataFrame.reindex(
    self,
    labels=None,
    index=None,
    columns=None,
    axis=None,
    method=None,
    copy=True,
    level=None,
    fill_value=nan,
    limit=None,
    tolerance=None,
)

下面是参数说明，其中与Series类似的，在下表就不过多赘述。

参数	说明
labels	新标签/索引使`axis`指定的轴与之一致。
axis	指定索引的作用域，可以是轴名称（`index`，`columns`）或数字（`0`、`1`）。

与Series类似，在遇到没有的值时，会默认以NaN替换，当然也可以使用filll_value进行填充。

In [12]: df = pd.DataFrame(np.arange(9).reshape(3,3),
    ...:                   index = ['a','b','c'],
    ...:                   columns = ['A','B','C'])
# 原对象
In [13]: df
Out[13]:
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8
# 对空缺值默认填充NaN
In [14]: df.reindex(index=['b','e'],columns=['A','C','D'])
Out[14]:
     A    C   D
b  3.0  5.0 NaN
e  NaN  NaN NaN
# 对于空缺值填充设置的值
In [15]: df.reindex(index=['b','e'],columns=['A','C','D'],fill_value = -1)
Out[15]:
   A  C  D
b  3  5 -1
e -1 -1 -1

同样也可以只用method来实现之填充。

# 使用ffill填充。bfill和nearest类似不赘述
In [16]: df.reindex(index=['b','e'],columns=['A','C','D'],method='ffill')
Out[16]:
   A  C  D
b  3  5  5
e  6  8  8

这里说明的是limit限制的最大值，是基于reindex重建索引后的数据的距离。

In [21]: df.reindex(index=['a','b','c','d','e'],
				  columns=['A','B','C','D'],
    ...: method='ffill',limit = 1)
Out[21]:
     A    B    C    D
a  0.0  1.0  2.0  2.0
b  3.0  4.0  5.0  5.0
c  6.0  7.0  8.0  8.0
d  6.0  7.0  8.0  8.0
e  NaN  NaN  NaN  NaN

In [22]: df.reindex(index=['b','e'],
				  columns=['A','C','D'],
				  method='ffill',limit=1)
Out[22]:
   A  C  D
b  3  5  5
e  6  8  8

通过axis，我们可以指定前一段数组的作用域。

In [24]: df
Out[24]:
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8
# 由于默认是index，所以会在行索引查找不到A
In [25]: df.reindex(['A'])
Out[25]:
    A   B   C
A NaN NaN NaN
# 通过axis='columns'或者axis='1'来确定作用域为列索引
In [26]: df.reindex(['A'],axis = 1)
Out[26]:
   A
a  0
b  3
c  6

loc和iloc标签索引和位置索引

loc
与reindex类似的是我们也可以使用标签索引。不同的是loc相当于原对象的视图。标签索引有点像numpy中的mask。标签缩影范围是双闭区间，Python中的索引是左闭右开。其可传入的标签类型有以下几种；

类型	解释
单标签	例如`2`或`a`这里的2不是索引值，而是数字标签(代码中将会区分这两种区别)
列表或数组	由标签构成的数组或列表，例如`['a','c','d']`
切片	带有标签的切片对象，例如`['a':'f']`

Series

In [28]: a = pd.Series([5,8,6,-7,3],index = ['a','b','c','d','e'])
In [29]: b = pd.Series([5,8,6,-7,3],index = range(0,5))
# 通过数值索引来取值左闭右开
In [30]: b[:3]
Out[30]:
0    5
1    8
2    6
dtype: int64
# 通过标签索引来取值，这里的数字其实是数字类型的标签
# 和数值索引的数字不是一个东西。标签索引双闭。
In [31]: b.loc[:3]
Out[31]:
0    5
1    8
2    6
3   -7
dtype: int64
# 单标签索引
In [32]: b.loc[3]
Out[32]: -7
In [33]: a.loc['c']
Out[33]: 6

# 标签构成的数组，索引
In [35]: a.loc[['c','b','d']]
Out[35]:
c    6
b    8
d   -7
dtype: int64
In [36]: b.loc[[1,4,2]]
Out[36]:
1    8
4    3
2    6
dtype: int64

# 切片
In [37]: a.loc[:'d']
Out[37]:
a    5
b    8
c    6
d   -7
dtype: int64
In [38]: b.loc[:4]
Out[38]:
0    5
1    8
2    6
3   -7
4    3
dtype: int64

DataFrame

In [10]: df = pd.DataFrame(np.arange(12).reshape(4,3),
    ...:                 index=['a','b','c','d'],
    ...:                 columns=['A','B','C'])

# 单标签索引，注意单标签索引会将行作为Series
In [11]: df.loc['a']
Out[11]:
A    0
B    1
C    2
Name: a, dtype: int32
# 可以使用[[]]来将其作为DataFrame
In [12]: df.loc[['a']]
Out[12]:
   A  B  C
a  0  1  2
# 分别为index和columns标签，以确定一个值
In [13]: df.loc['a','A']
Out[13]: 0
# 也可以组合使用
In [14]: df.loc['a':'c','A']
Out[14]:
a    0
b    3
c    6
Name: A, dtype: int32
# 按行选取
In [15]: df.loc[[True,False,True]]
Out[15]:
   A  B  C
a  0  1  2
c  6  7  8
# 获取一列
In [22]: df.loc[:,'A']
Out[22]:
a    0
b    3
c    6
d    9

iloc

iloc是纯粹由数字构成的位置索引。下面让我们看一下iloc允许的输入类型。

类型	说明
单整型	例如`5`
数组或列表	例如`[4,3,0]`
切片	例如`1:7`

In [4]: se = pd.Series([3,1,-5,7])

In [5]: df = pd.DataFrame(np.arange(12).reshape(4,3),
   ...:                 index=['a','b','c','d'],
   ...:               columns=['A','B','C'])

iloc,对于一维Series可以传入单数值或者通过列表传入多个值，对于二维的DataFrame可以传入两个单值或者通过列表传入多个值。

# 单数值对于Series只显示一个值
In [6]: se.iloc[0]
Out[6]: 3
# 对于DataFrame则以Series显示一行
In [7]: df.iloc[0]
Out[7]:
A    0
B    1
C    2
Name: a, dtype: int32
# 我们可以通过传出list类型以DataFrame形式显示
In [8]: df.iloc[[0]]
Out[8]:
   A  B  C
a  0  1  2

# 这里值得注意的是，如果直接传入两个数值，
# 其含义分别是横纵坐标的位置
In [9]: df.iloc[0,1]
Out[9]: 1

# 同样的，我们可以在每一维度上传入list以DataFrame 形式显示
In [10]: df.iloc[[0],[1]]
Out[10]:
   B
a  1

# 所以对于Series 这种一维序列就会报错
se.iloc[0,1]
IndexingError: Too many indexers

# 如果我们想在某一维度上获取更多的值，
# 可以以列表的形式占用一个位置，传多个值
In [12]: se.iloc[[0,1]]
Out[12]:
0    3
1    1
dtype: int64
In [13]: df.iloc[[0,1]]
Out[13]:
   A  B  C
a  0  1  2
b  3  4  5

下面代码是关于通过切片

In [14]: se.iloc[:2]
Out[14]:
0    3
1    1
dtype: int64

# 对于DataFrame，可以对每一维度进行切片
In [15]: df.iloc[:2,1:]
Out[15]:
   B  C
a  1  2
b  4  5

通过布尔类型的mask来进行索引。(注意长度要匹配)

In [20]: se.iloc[[True,False,True,False]]
Out[20]:
0    3
2   -5
dtype: int64
In [21]: df.iloc[[True,True,False,False],[False,False,True]]
Out[21]:
   C
a  2
b  5

我们也可以使用lambda函数，默认将Series和DataFrame传入。

In [22]: se.iloc[lambda se:se.index%2==0]
Out[22]:
0    3
2   -5
dtype: int64

In [23]: df.iloc[:,lambda df:[1,2]]
Out[23]:
    B   C
a   1   2
b   4   5
c   7   8
d  10  11

drop 轴向上删除

我们可以使用drop来实现对某轴向上依据标签进行删除。

Series

In [10]: se
Out[10]:
a    5
b    7
c   -3
d   -6
dtype: int64
# 删除单个元素
In [11]: se.drop('a')
Out[11]:
b    7
c   -3
d   -6
dtype: int64
# 删除多个元素
In [12]: se.drop(['a','c'])
Out[12]:
b    7
d   -6
dtype: int64
# drop返回的是原对象的副本，其并不会作用在原函数上
In [13]: se
Out[13]:
a    5
b    7
c   -3
d   -6
dtype: int64
# 默认inplace=False，我们可以通过修改
# inplace来实现在原函数上删除True时不返
In [14]: se.drop(['a','c'],inplace = True)

In [15]: se
Out[15]:
b    7
d   -6
dtype: int64

DataFrame

In [20]: df
Out[20]:
   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
# 删除多个列
In [21]: df.drop(['A','B'],axis=1)
Out[21]:
    C
0   2
1   5
2   8
3  11
# 删除多个行
In [22]: df.drop([1,3])
Out[22]:
   A  B  C
0  0  1  2
2  6  7  8
# 删除单个行
In [23]: df.drop(1)
Out[23]:
   A   B   C
0  0   1   2
2  6   7   8
3  9  10  11
# 删除单个列
In [24]: df.drop('A',axis =1)
Out[24]:
    B   C
0   1   2
1   4   5
2   7   8
3  10  11
In [25]: df.drop('A',axis ='columns')
Out[25]:
    B   C
0   1   2
1   4   5
2   7   8
3  10  11

索引、选择与过滤

Series
通过标签切片是双闭的，两边都能取到使用单值或序列，可以从Series中索引出一个或多个值

In [28]: se
Out[28]:
a    5
b    1
c    7
d   -6
# 切片索引左闭右开
In [29]: se[1:3]
Out[29]:
b    1
c    7
# 标签索引双闭
In [30]: se['a':'c']
Out[30]:
a    5
b    1
c    7
# 通过数值索引，当然为负值时也是可以的
In [31]: se[1]
Out[31]: 1
In [32]: se[[1,3]]
Out[32]:
b    1
d   -6

# 通过标签索引
In [33]: se['a']
Out[33]: 5
In [34]: se[['a','c']]
Out[34]:
a    5
c    7
# 如果对齐修改，是可以作用到原对象的
In [36]: se[[1,3]] = 5
In [37]: se
Out[37]:
a    5
b    5
c    7
d    5
# 也可以使用布尔值索引
In [38]: se[se==7]
Out[38]:
c    7

DataFrame

类型	描述
df[val]	从`DataFrame`中选择单列或列序列；特殊情况：布尔数组（过滤行），切片（切片行）或布尔值`DataFrame`
df.loc[cal]	根据标签选择`DataFrame`的单行或多行
df.loc[:,val]	根据标签选择单列或多列
def.loc[val1,val2]	根据标签选择单个值
def.iloc[where]	根据整数位置选择单行或多行
df.iloc[:,where]	根据整数位置选择单列或多列
df.iloc[where_i,where_j]	根据整数位置选择单个值

df.at[label_i,label_j]根据行列标签选择单个值
df.iat[i,j]|根据行列整数位置选择单个值
reindex方法|通过标签选择行或列
get_value,set_value方法|根据行和列标签设置单个值

使用单值或序列，可以从DataFrame中索引出一个或多个列。

In [50]: df
Out[50]:
   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
# 使用单值
In [51]: df['A']
Out[51]:
0    0
1    3
2    6
3    9
Name: A, dtype: int32
# DataFrame是不接受列切片的
In [52]: df[['A':'B']]
SyntaxError: invalid syntax
# 使用序列索引
In [53]: df[['A','B']]
Out[53]:
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

如果我们想获取DataFrame中的某一行，怎么办呢？下面代码将讲述这些。

# 使用数值切片，注意单数值是可以以的，会当作列标签
In [62]: df[:2]
Out[62]:
   A  B  C
0  0  1  2
1  3  4  5

# 使用布尔值也是可以的
In [63]: df[df['A']>2]
Out[63]:
   A   B   C
1  3   4   5
2  6   7   8
3  9  10  11

除了上述的方法，我们也可以通过iloc和loc来进行索引，这里可以参见上文关于iloc和loc的综合介绍，这里不再赘述。

含有重复标签的轴索引

对于Series和DataFrame由于并不强制标签值唯一，因此可以通过索引的is_unique属性来判别标签的唯一性。

In [33]: se = pd.Series([2,4,-1,8],index=['a','a','b','c'])
In [34]: df = pd.DataFrame(np.arange(9).reshape(3,3),index=['a','b','a'])
# 判断索引标签的唯一性
In [35]: se.index.is_unique
Out[35]: False
In [36]: df.index.is_unique
Out[36]: False

In [35]: se['a']
Out[35]:
a    2
a    4
In [42]: df.loc['a']
Out[42]:
   0  1  2
a  0  1  2
a  6  7  8

根据列值，选取DataFrame数据

# 选取等于某些值的行记录 用 == 
df.loc[df['column_name'] == some_value]

# 选取某列是否是某一类型的数值 用 isin
df.loc[df['column_name'].isin(some_values)]

# 多种条件的选取 用 &
df.loc[(df['column'] == some_value) & df['other_column'].isin(some_values)]

# 选取不等于某些值的行记录 用 ！=
df.loc[df['column_name'] != some_value]

# isin返回一系列的数值,如果要选择不符合这个条件的数值使用~
df.loc[~df['column_name'].isin(some_values)]

算术操作

同对象操作

正如上文提到的，当你将两个对象相加时，返回的结果将是索引对的并集，这有点像数据库中的外连接。

In [3]: se_a = pd.Series([9,-1,3,4],index=['a','c','e','f'])
In [4]: se_b = pd.Series([8,4,6,-7],index=['a','b','c','d'])
In [5]: df_a = pd.DataFrame(np.arange(6).reshape(2,3),
					columns=['A','C','D'])
In [6]: df_b = pd.DataFrame(np.arange(6).reshape(2,3),
					columns=['A','B','C'])

在进行运算时，会自动对齐到相应的标签上。不存在的地方以NaN替代。

In [7]: se_a+se_b
Out[7]:
a    17.0
b     NaN
c     5.0
d     NaN
e     NaN
f     NaN
dtype: float64
In [8]: df_a+df_b
Out[8]:
   A   B  C   D
0  0 NaN  3 NaN
1  6 NaN  9 NaN

因为NaN会传播，所以有时候我不希望它存在，想在运算时对不存在的数据进行赋值填充，这时候我们可以通过对应的方法来进行操作。其中每个方法都有一个对应的以r开头的副本，这些副本参数的方法是反转的。例如a.div(b)和b.rdiv(a)是等价的。

方法	操作
add/radd	加法（+）
sub/rsub	减法（-）
div/rdiv	除法（/）
floordiv/rfloordiv	整除（//）
mul/rmul	乘法（*）
pow/rpow	幂次方（**）

# 填充空缺值
In [10]: df_a.add(df_b,fill_value=10)
Out[10]:
   A     B  C     D
0  0  11.0  3  12.0
1  6  14.0  9  15.0
# div和rdiv
In [11]: 1/df_a
Out[11]:
          A     C    D
0       inf  1.00  0.5
1  0.333333  0.25  0.2

In [12]: df_a.rdiv(1)
Out[12]:
          A     C    D
0       inf  1.00  0.5
1  0.333333  0.25  0.2

Series和DataFrame间操作

Series和DataFrame之间的算术操作与numpy不同维度数组间的操作类似，numpy在操作时会对买一行进行广播运算。

In [13]: arr_a = np.arange(6).reshape(3,2)
In [14]: arr_b = np.array([2,5])
# 广播运算
In [15]: arr_a-arr_b
Out[15]:
array([[-2, -4],
       [ 0, -2],
       [ 2,  0]])

类似的Series和DataFrame之间的算术操作是对Series进行行广播。

In [19]: se = pd.Series([5,7,1],index=['A','B','C'])
In [20]: df = pd.DataFrame(np.arange(9).reshape(3,3),
					columns=['A','B','C'])

In [21]: df-se
Out[21]:
   A  B  C
0 -5 -6  1
1 -2 -3  4
2  1  0  7

# 如果Series中存在DataFrame中不存在的标签，
# 则对象会重建索引并形成联合
In [22]: se_1 = pd.Series([5,7,1,2],
					index=['A','B','C','D'])
In [23]: df-se_1
Out[23]:
   A  B  C   D
0 -5 -6  1 NaN
1 -2 -3  4 NaN
2  1  0  7 NaN

如果想进行列匹配，则必须用算术方法

In [29]: se_2 = df['A']
In [30]: se_2
Out[30]:
0    0
1    3
2    6
Name: A, dtype: int32

In [31]: df.sub(se_2,axis='index')
Out[31]:
   A  B  C
0  0  1  2
1  0  1  2
2  0  1  2

函数应用和映射

numpy函数
numpy的通用函数（逐元素数组方法）对pandas对象也是有效果的。

In [3]: df = pd.DataFrame(np.random.randn(4,3),
   ...:             index=['a','b','c','d'],
   ...:             columns=['A','B','C'])

In [4]: df
Out[4]:
          A         B         C
a -0.584069  0.114854 -2.415498
b -0.550652  0.395374 -1.372510
c -0.315824  0.258919 -0.056640
d  0.036870 -0.445996  1.435676

In [5]: np.abs(df)
Out[5]:
          A         B         C
a  0.584069  0.114854  2.415498
b  0.550652  0.395374  1.372510
c  0.315824  0.258919  0.056640
d  0.036870  0.445996  1.435676

apply
沿DataFrame的轴应用功能。

参数	说明
func	应用于每个列或行的函数。
axis	确定该函数应用于行还是列（默认`0/index`作用于行，当为`1/columns`时作用于列）
raw	默认为`False`，将每一行或列作为一个`Series`传入函数中，`True`时，将以`ndarray`形式传入。
result_type	`‘expend’`：列表状的结果将变成`columns`。`'reduce'`如果可能，返回一个Series，而不是扩展类似列表的结果。这与“expend”相反。`'broadcast'`结果将广播到`DataFrame`的原始形状，原始索引和列将保留。`None`默认行为取决于所应用函数的返回值：类似于列表的结果将作为`Series`结果返回。但是，如果`apply`函数返回`Series`，则将它们扩展为列。

In [28]: df
Out[28]:
   A  B
a  2 -5
b  2 -5
c  2 -5
# 默认对每行元素求和
In [29]: df.apply(np.sum)
Out[29]:
A     6
B   -15
dtype: int64
# 对每列元素求和
In [30]: df.apply(np.sum,axis=1)
Out[30]:
a   -3
b   -3
c   -3
dtype: int64
# result_type='None' 
# 类似于列表的结果将作为Series结果返回
In [31]: df.apply(lambda x:[1,2])
Out[31]:
A    [1, 2]
B    [1, 2]
# result_type='expand'
# 将类似于列表的结果，扩展为Series，
# 值得注意的是拓展后的索引被改变了
 In [32]: df.apply(lambda x:[1,2],result_type='expand')
Out[32]:
   A  B
0  1  1
1  2  2
In [33]: df.apply(lambda x:[1,2],axis=1,result_type='expand')
Out[33]:
   0  1
a  1  2
b  1  2
c  1  2
# result_type='broadcast'
# 将结果在原型状下广播，如果是列表则需注意展
# 开后的形状是否匹配，标量则无需注意
In [34]: df.apply(lambda x:[1,2],result_type='broadcast')
ValueError: cannot broadcast result
# 对于[1,2]是能够在axis=1上不影响形状展开的
In [35]: df.apply(lambda x：[1,2],result_type='broadcast',axis=1)
Out[36]:
   A  B
a  1  2
b  1  2
c  1  2
# 对于标量则不需要担心展开后形状的问题
In [37]: df.apply(lambda x:1,result_type='broadcast')
Out[37]:
   A  B
a  1  1
b  1  1
c  1  1

练习
1.求最大最小值

In [46]: df
Out[46]:
          A         B         C
0 -0.364608 -0.925359  0.251871
1  1.308153 -0.983261  0.780449
2 -0.138446 -0.187765 -0.555508
3  0.358057  0.944677 -0.127748

In [47]: def f(x):
    ...:     return pd.Series([x.max(),x.min()],index=['max','min'])
    ...:

In [48]: df.apply(f)
Out[48]:
            A         B         C
max  1.308153  0.944677  0.780449
min -0.364608 -0.983261 -0.555508

2.对每个元素取小数点后两位

In [49]: df.applymap(lambda x:'%.2f' %x)
Out[49]:
       A      B      C
0  -0.36  -0.93   0.25
1   1.31  -0.98   0.78
2  -0.14  -0.19  -0.56
3   0.36   0.94  -0.13
# applymap函数等价于Series函数中的map
In [50]: df['A'].map(lambda x:'{:.2f}'.format(x))
Out[50]:
0    -0.36
1     1.31
2    -0.14
3     0.36
Name: A, dtype: object

排序、排名

sort_index

如果想对行或列的索引进行字典型排序，需要使用sort_index方法。

pd.DataFrame.sort_index(
   self,
   axis=0,
   level=None,
   ascending=True,
   inplace=False,
   kind='quicksort',
   na_position='last',
   sort_remaining=True,
   by=None,)

参数	说明
axis	直接排序的`index`或`columns`
level	`int`或`level`名，`int`或`level`名的列表，对指定索引级别的值进行排序
ascending	升序与降序排序，默认顺序`True`
kind	选择排序算法，有效值是`quicksort`,`mergesort`,`heapsort`,默认为`quiclsort`
na_position	`NaN`位置，默认`last`放置于最后，`first`放置于最前
sort_remaining	如果设置为`True`，对多级索引而言，其他级别的索引也会相应的进行排序。

In [6]: df
Out[6]:
   C  A  D  B
b  0  1  2  3
a  4  5  6  7

In [7]: df.sort_index()
Out[7]:
   C  A  D  B
a  4  5  6  7
b  0  1  2  3

In [8]: df.sort_index(axis=1)
Out[8]:
   A  B  C  D
b  1  3  0  2
a  5  7  4  6

sort_values

如果想对Series的值进行排序则需要使用sort_values方法。

pd.Series.sort_values(
    self,
    axis=0,
    ascending=True,
    inplace=False,
    kind='quicksort',
    na_position='last',)

参数	说明
ascending	默认为升序`true`
na_position	`NaN`位置，默认`last`放置于最后。`first`放置于最前
kind	选择排序算法，有效值是`quicksort`,`mergesort`,`heapsort`,默认为`quiclsort`
inplace	默认`False`在副本操作并返回结果，为`True`，则就地执行操作

In [9]: se = pd.Series([4,7,-2,3])
# 排序 升序
In [10]: se.sort_values()
Out[10]:
2   -2
3    3
0    4
1    7
# 排序 降序
In [11]: se.sort_values(ascending=False)
Out[11]:
1    7
0    4
3    3
2   -2

对于DataFrame需要使用by来指定排序的Series，其他的并无区别。

In [18]: df.sort_values(by='a')
Out[18]:
   a  b  c
2 -1  1  5
1  2  7 -7
0  2  3  4
3  6  8  2

In [19]: df.sort_values(by=['a','b'])
Out[19]:
   a  b  c
2 -1  1  5
0  2  3  4
1  2  7 -7
3  6  8  2

rank

排序是指对数组从1到有效值数据点总数分配名次的操作。Series和DataFrame的rank方法是实现排名的方法。

pd.Series.rank(
    self,
    axis=0,
    method='average',
    numeric_only=None,
    na_option='keep',
    ascending=True,
    pct=False,)

参数	说明
method	`average`平均排名。`min`向低取排名。`max`向高取排名。`first`排列顺序以它们出现在数组中的顺序。`dense`类似于最小排名，但组间排名总增加`1`。
numeric_only	默认为`None`，仅包含float，int，boolean。
na_option	`keep`保持`NaN`在原位置。`top`保持`NaN`在最高位。`bottom`保持`NaN`在最低位。
ascending	默认`True`升序，`False`为降序。
pct	当为`True`时，计算数据的百分比等级。

In [23]: se = pd.Series([7,-5,7,4,2,0,4])
# 默认是average
In [24]: se.rank()
Out[24]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5

# 向下取排名
In [25]: se.rank(method='min')
Out[25]:
0    6.0
1    1.0
2    6.0
3    4.0
4    3.0
5    2.0
6    4.0

向上区排名
In [26]: se.rank(method='max')
Out[26]:
0    7.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    5.0

# 对于相同的排名则按索引顺序排列
In [27]: se.rank(method='first')
Out[27]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0

# 向下取排名，与min不同的是dense不跳级
In [28]: se.rank(method='dense')
Out[28]:
0    5.0
1    1.0
2    5.0
3    4.0
4    3.0
5    2.0
6    4.0

# 百分比显示，可以与method结合
In [29]: se.rank(pct=True)
Out[29]:
0    0.928571
1    0.142857
2    0.928571
3    0.642857
4    0.428571
5    0.285714
6    0.642857
dtype: float64

归约、统计

归约

pandas中也配备了一些类似于numpy中的一些函数，与numpy数组中类似方法相比，他们内建了处理缺失值的功能。

参数	说明
axis	归约轴，`0/index`行 `1/columns`列
skipna	排除缺失值，默认为True
level	如果轴是`MultiIndex`，则沿特定级别计数，并折叠为`Series`。

In [9]: df
Out[9]:
      A    B
a  1.40  NaN
b  7.10 -4.2
c   NaN  NaN
d  0.75 -1.3

In [10]: df.sum()
Out[10]:
A    9.25
B   -5.50
dtype: float64

In [11]: df.sum(axis=1)
Out[11]:
a    1.40
b    2.90
c    0.00
d   -0.55
dtype: float64

In [12]: df.mean(axis=1,skipna=False)
Out[12]:
a      NaN
b    1.450
c      NaN
d   -0.275
dtype: float64

统计

下面是一些常用的统计方法

方法	说明
count	非Na值的个数
describe	计算`Series`或`DataFrame`各列的汇总统计集合
min,max	计算最大最小值
argmin,argmax	分别计算最大最小值所在的索引位置(整数)
idxmin,idxmax	分别计算最大最小值所在的索引标签
quantile	计算样本从0到1间的分位数
sum	加和
mean	均值
media	中位数（50%分位数）
mad	平均值的平均绝对偏差
prod	所有值的积
var	值的样本方差
std	值的样本标准差
skew	样本偏度（第三刻度）值
kurt	样本峰度（第四刻度）的值
cumsum	累计值
cummin,cummax	累计值的最大值或最小值
cumprod	值的累计积
diff	计算第一个算术差值（对时间序列有用）
pct_change	计算百分比

分组与聚合

分组

在对数据进行处理的时候，分组与聚合是非常常用的操作。在Pandas中此类操作主要是通过groupby函数来完成的。
先看一个实际的例子：

# 生成一个原始的DataFrame
In [70]: raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawk
    ...: s', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scou
    ...: ts', 'Scouts'],
    ...:         'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1
    ...: st', '1st', '2nd', '2nd'],
    ...:         'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ry
    ...: aner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
    ...:         'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
    ...:         'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
    ...:

In [71]: df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTes
    ...: tScore', 'postTestScore'])

In [72]: df
Out[72]:
      regiment company      name  preTestScore  postTestScore
0   Nighthawks     1st    Miller             4             25
1   Nighthawks     1st  Jacobson            24             94
2   Nighthawks     2nd       Ali            31             57
3   Nighthawks     2nd    Milner             2             62
4     Dragoons     1st     Cooze             3             70
5     Dragoons     1st     Jacon             4             25
6     Dragoons     2nd    Ryaner            24             94
7     Dragoons     2nd      Sone            31             57
8       Scouts     1st     Sloan             2             62
9       Scouts     1st     Piger             3             70
10      Scouts     2nd     Riani             2             62
11      Scouts     2nd       Ali             3             70

通过groupby函数生成一个groupby对象，如下：

# 当针对特定列（此例是'preTestScore'）进行分组时，需要通过df['colume_name'](此例是df['regiment'])来指定键名
In [73]: groupby_regiment = df['preTestScore'].groupby(df['regiment'])

# 生成的groupby对象没有做任何计算，只是将数据按键进行分组
In [74]: groupby_regiment
Out[74]: <pandas.core.groupby.SeriesGroupBy object at 0x11112cef0>

# 分组的聚合统计
In [75]: groupby_regiment.describe()
Out[75]:
            count   mean        std  min   25%   50%    75%   max
regiment
Dragoons      4.0  15.50  14.153916  3.0  3.75  14.0  25.75  31.0
Nighthawks    4.0  15.25  14.453950  2.0  3.50  14.0  25.75  31.0
Scouts        4.0   2.50   0.577350  2.0  2.00   2.5   3.00   3.0

# 也可以针对特定统计单独计算
In [76]: groupby_regiment.mean()
Out[76]:
regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64

整个分组统计的过程，可以通过下图更清晰地展示：
在这里插入图片描述

聚合的时候，既可以使用Pandas内置的函数进行聚合计算，也可以使用自定义的函数进行聚合计算，我们先来看下内置的函数：

另外，我们也可以自定义聚合函数：

In [81]: def my_agg(pre_test_score_group):
    ...:     return np.sum(np.power(pre_test_score_group, 2))
    ...:

In [82]: df['preTestScore'].groupby(df['regiment']).apply(my_agg)
Out[82]:
regiment
Dragoons      1562
Nighthawks    1557
Scouts          26
Name: preTestScore, dtype: int64

通过上面的例子我们可以看到，通过apply函数也可以完成类似for循环的迭代，在pandas中尽可能使用apply函数来代替for循环迭代，以提高性能。

根据多个键进行分组和聚合

# 如果有多个键，将多个键放到一个list当中，作为groupby的参数
In [77]: df['preTestScore'].groupby([df['regiment'], df['company']]).mean()
Out[77]:
regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64

# unstack之后变成表格模式，更加清晰
In [78]: df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
Out[78]:
company      1st   2nd
regiment
Dragoons     3.5  27.5
Nighthawks  14.0  16.5
Scouts       2.5   2.5

唯一值、计数和成员属性

对于一维的Series，可能会有很多重复的值，我们可以通过方法unique得到唯一的值，value_counts进行值频统计。

In [46]: se = pd.Series(['c','a','d','a','a','b','b','c','c'])
# 统计无重复值
In [47]: se.unique()
Out[47]: array(['c', 'a', 'd', 'b'], dtype=object)
# 统计词频
In [48]: se.value_counts()
Out[48]:
a    3
c    3
b    2
d    1
# 若不想排序这样也可以
In [49]: se.value_counts(sort=False)
Out[49]:
d    1
c    3
a    3
b    2

我们可以通过isin函数来过滤掉不想要的值。

In [50]: se.isin(['a','b'])
Out[50]:
0    False
1     True
2    False
3     True
4     True
5     True
6     True
7    False
8    False
dtype: bool

In [51]: mask =se.isin(['a','b'])

In [52]: se[mask]
Out[52]:
1    a
3    a
4    a
5    b
6    b
dtype: object

与isin相关的Index.get_indexer方法，可以提供一个索引数组，这个索引数组可以将可能非唯一值数组转化为另一个唯一值数组。index.get_indexer的作用是在已知的索引作为另一个Series的值所对应的索引，若无对应则返回-1。

In [60]: index=pd.Index(['a','b','c'])

In [61]: index.get_indexer(se)
Out[61]: array([ 2,  0, -1,  0,  0,  1,  1,  2,  2], dtype=int32)

如果想统计整个DataFrame中每一列重复值的频率，可以将pd.value_counts传入DataFrame的apply函数。

In [70]: df
Out[70]:
   A  B  C
0  1  2  1
1  3  3  5
2  4  1  2
3  3  2  4
4  4  3  4

In [71]: df.apply(pd.value_counts)
Out[71]:
     A    B    C
1  1.0  1.0  1.0
2  NaN  2.0  1.0
3  2.0  2.0  NaN
4  2.0  NaN  2.0
5  NaN  NaN  1.0

In [72]: df.apply(pd.value_counts).fillna(0)
Out[72]:
     A    B    C
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0