利用python进行数据分析——pandas基础

最新推荐文章于 2024-06-22 10:30:53 发布

止步听风

最新推荐文章于 2024-06-22 10:30:53 发布

阅读量1k

点赞数 18

分类专栏： # 利用python进行数据分析文章标签：数据分析 pandas numpy dataframe series

本文链接：https://blog.csdn.net/SAKURASANN/article/details/136030383

版权

利用python进行数据分析专栏收录该内容

3 篇文章 0 订阅

订阅专栏

这里主要是对《利用python进行数据分析》的学习，原书的电子版地址为：

https://github.com/iamseancheney/python_for_data_analysis_2nd_chinese_version
不知道这个项目是不是译者或者是什么好心人整理的。

数据结构

pandas是基于numpy数组构建的，但pandas更是专门为了处理表格和混杂数据设计的，而numpy则更适合处理同质数据。

pandas中的数据结构主要有两种：

series：类似于一维数组
dataframe：类似于多维数组

import pandas as pd
from pandas import Series, DataFrame

Series

之前提到series类似于一维数组，不同的是相较于numpy中的数组，series中的还有与数据相关的标签，也就是索引：

a = pd.Series(range(5))
print(a, a.index, a.values,sep="\n\n")

结果为：

0    0
1    1
2    2
3    3
4    4
dtype: int64

RangeIndex(start=0, stop=5, step=1)

[0 1 2 3 4]

可以看出，series类型是由index和value组合而成的，类似于numpy中的dtype，pandas中的index也是能够指定的：

a = pd.Series(range(5), index=["one","two","three","four","five"])
print(a, a.index, a.values,sep="\n\n")

结果为：

one      0
two      1
three    2
four     3
five     4
dtype: int64

Index(['one', 'two', 'three', 'four', 'five'], dtype='object')

[0 1 2 3 4]

可见，value部分的值没变，只是index的部分更改了。而之前由于指定index部分，因此默认是由递增的整数型数值代替的。

因此对于数据的访问，可以通过多种形式访问：

a = pd.Series(range(5), index=["one","two","three","four","five"])
print(a[1],a[0:2],a["two"],a[["two","one","five"]],a[a>2],sep="\n\n")

结果为：

1

one    0
two    1
dtype: int64

1

two     1
one     0
five    4
dtype: int64

four    3
five    4
dtype: int64

即可以通过下标，切片，index，index列表，布尔值数组的形式进行索引访问，这些基本和numpy中数组的访问形式是一致的。

同样series的算术运算也是和numpy一致的，另外因为index是pandas中数据的重要组成部分，因此针对index的逻辑判断也是合理的。

a = pd.Series(range(5), index=["one","two","three","four","five"])
print(a ** 2, "one" in a, sep="\n\n")

结果为：

one       0
two       1
three     4
four      9
five     16
dtype: int64

True

上面对索引的逻辑判断和字典的键值判断很类似，因此从字典来生成pandas数据也是可以的：

a = pd.Series({"one":1,"two":2,"three":3,"four":4,"five":5})
print(a)

结果为：

one      1
two      2
three    3
four     4
five     5
dtype: int64

即会将字典的键作为index，值作为value来组织series数据。

而如果同时传入字典和index，就会将index和字典中的键做匹配生成series数据，如果index不能匹配字典中的键，就会以NaN代替：

a = pd.Series({"one":1,"two":2,"three":3,"four":4,"five":5}, index=["one","two","three","four","six"])
print(a)

结果为：

one      1.0
two      2.0
three    3.0
four     4.0
six      NaN
dtype: float64

可见，字典中的"five"不在index中，会直接从结果中除去。

同时NaN可以使用isnull函数或notnull函数来检测：

a.isnull()
pd.isnull(a)
a.notnull()
pd.notnull(a)

这四种形式得到的结果是类似的：

one      False
two      False
three    False
four     False
six       True
dtype: bool

之前提到series对象和标量间的运算会作用到每个元素上，而如果是series对象间的算术运算则会根据index来确认结果：

a = pd.Series(range(5), index=["one","two","three","four","five"])
b = pd.Series({"one":1,"two":2,"three":3,"four":4,"five":5}, index=["one","two","three","four","six"])
print(a,b,a+b,sep="\n\n")

结果为：

one      0
two      1
three    2
four     3
five     4
dtype: int64

one      1.0
two      2.0
three    3.0
four     4.0
six      NaN
dtype: float64

five     NaN
four     7.0
one      1.0
six      NaN
three    5.0
two      3.0
dtype: float64

同时还可以设置series对象的name属性及index的name属性：

a = pd.Series(range(5), index=["one","two","three","four","five"])
a.name = "test"
a.index.name = "index_test"
print(a,sep="\n\n")

结果为：

index_test
one      0
two      1
three    2
four     3
five     4
Name: test, dtype: int64

DataFrame

如果说Series是一维数据，那么DataFrame就可以理解为二维数据，即表格型数据。和Series类似，DataFrame有一组有序的列，每列都可以有不同的值类型(数值、字符串和布尔值等)。

通过传入一个由等长列表或NumPy数组组成的字典可以直接构建DataFrame对象：

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

frame结果为：

    state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

对于比较大的DataFrame，head方法会选择前N行：

frame.head(3)

结果为：

    state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6

而如果指定了列序列，DataFrame就会按照顺序重新进行排列：

pd.DataFrame(frame, columns=["year","state","pop"])

结果为：

    year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9
5	2003	Nevada	3.2

而如果传入的列在数据中找不到，就会产生NaN：

frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])

frame2结果为：

	    year	state	pop	debt
one	    2000	Ohio	1.5	NaN
two	    2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN
six	    2003	Nevada	3.2	NaN

这里可以看到，DataFrame存在各种属性：

print(frame2.index, frame2.values, frame2.columns, sep="\n\n")

结果为：

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

[[2000 'Ohio' 1.5 nan]
 [2001 'Ohio' 1.7 nan]
 [2002 'Ohio' 3.6 nan]
 [2001 'Nevada' 2.4 nan]
 [2002 'Nevada' 2.9 nan]
 [2003 'Nevada' 3.2 nan]]

Index(['year', 'state', 'pop', 'debt'], dtype='object')

和Series类似，DataFrame也可以设置index和columns的name属性：

frame3.index.name = "year"
frame3.columns.name = "state"
print(frame3)

结果为：

state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

而通过列名就可以获取DataFrame的列，该列为Series对象：

print(frame2.state, frame2["state"], type(frame2.state), sep="\n\n")

结果为：

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

<class 'pandas.core.series.Series'>

可见frame2.state和frame2["state"]两种方式获取到的数据一样，均为Series类型，同时index的name属性也被设置为列名。

上面获取到的是列数据，也可以获取到行数据。行数据可以通过位置或者名称方式进行获取：

比如可以通过索引利用loc属性获取：

frame2.loc[["three","two"]]

结果为：

        year	state	pop	debt
three	2002	Ohio	3.6	NaN
two	    2001	Ohio	1.7	NaN

也可以通过索引利用iloc属性获取：

frame2.iloc[[2,1]]

结果为：

        year	state	pop	debt
three	2002	Ohio	3.6	NaN
two	    2001	Ohio	1.7	NaN

而赋值可以对单个元素赋值，也可以对整行或整列赋值，不过将列表或数组赋值给某个列时，其长度必须跟DataFrame的长度相匹配：

frame2.debt[1] = 100
print(frame2,"\n")
frame2.loc["three"] = [2005, "abc",10,20]
print(frame2,"\n")
frame2.debt = range(6)
print(frame2,"\n")

结果为：

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  100
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN 

       year   state   pop debt
one    2000    Ohio   1.5  NaN
two    2001    Ohio   1.7  100
three  2005     abc  10.0   20
four   2001  Nevada   2.4  NaN
five   2002  Nevada   2.9  NaN
six    2003  Nevada   3.2  NaN 

       year   state   pop  debt
one    2000    Ohio   1.5     0
two    2001    Ohio   1.7     1
three  2005     abc  10.0     2
four   2001  Nevada   2.4     3
five   2002  Nevada   2.9     4
six    2003  Nevada   3.2     5

也可以使用Series为DataFrame的某一列进行赋值，此时会通过两者的索引进行匹配，所有的空位会被填充NaN：

frame2["debt"] = val
frame2["new"] = val
print(frame2)

结果为：

       year   state   pop  debt  new
one    2000    Ohio   1.5   NaN  NaN
two    2001    Ohio   1.7  -1.2 -1.2
three  2005     abc  10.0   NaN  NaN
four   2001  Nevada   2.4  -1.5 -1.5
five   2002  Nevada   2.9  -1.7 -1.7
six    2003  Nevada   3.2   NaN  NaN

可见，为不存在的列赋值会创建出一个新列。

DataFrame列的删除则可以使用del关键字：

del frame2["new"]
print(frame2)

结果为：

       year   state   pop  debt
one    2000    Ohio   1.5   NaN
two    2001    Ohio   1.7  -1.2
three  2005     abc  10.0   NaN
four   2001  Nevada   2.4  -1.5
five   2002  Nevada   2.9  -1.7
six    2003  Nevada   3.2   NaN

而通过嵌套字典的方式也可以创建DataFrame，此时会将外层字典的键作为列，内层键则作为行索引：

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
print(frame3)

结果为：

      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

同时因为pandas基于numpy，因此可以DataFrame可以使用numpy的方法：

frame3.T

结果为：

        2000	2001	2002
Nevada	NaN	     2.4	2.9
Ohio	1.5	     1.7	3.6

下表为DataFrame构造函数所能接受的各种数据：

索引对象

pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时，所用到的任何数组或其他序列的标签都会被转换成一个Index：

index = frame3.index

index结果为：

Int64Index([2000, 2001, 2002], dtype='int64', name='year')

同时索引对象是不可变的，不能够进行修改和赋值。

也可以单独构建索引对象：

pd.Index(range(5))

结果为：

RangeIndex(start=0, stop=5, step=1)

这样的话，就可以在不同的pandas对象间共享索引对象。

不过同python的集合不同，pandas中的索引对象可以包含重复的标签：

pd.Index(["one","two","three","one"])

结果为：

Index(['one', 'two', 'three', 'one'], dtype='object')

和索引对象相关的方法和属性有：

基本功能

重新索引

pandas对象可以通过reindex方法进行重新索引，即对数据进行重新排序：

tmp = pd.Series([5,3,6,98,7],index=["three","four","two","one","five"])
tmp.reindex(["one","two","three","four","five"])

结果为：

one      98
two       6
three     5
four      3
five      7
dtype: int64

但reindex是创建了一个新对象，并不改变原对象。

和之前一样，如果reindex中的索引不存在，就会引入NaN。

同时重新索引时需要进行一些插值处理，可以借助其中的method参数进行填充设置，ffill表示使用前向值填充，bfill表示使用后向值填充，nearest表示使用最近值填充，默认不填充：

tmp = pd.Series(["red","orange","yellow","green","pink"],index=[0,1,2,4,6])
tmp.reindex(range(7),method="ffill")

结果为：

0       red
1    orange
2    yellow
3    yellow
4     green
5     green
6      pink
dtype: object

对于DataFrame对象来说，也是一样的：

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data, index=["one","two","three","four","five","six"])

frame.reindex(["six","five","four","three","two","one"])

结果为：

        state	year	pop
six	    Nevada	2003	3.2
five	Nevada	2002	2.9
four	Nevada	2001	2.4
three	Ohio	2002	3.6
two	    Ohio	2001	1.7
one	    Ohio	2000	1.5

除了根据index进行重新索引，也可以根据列名进行重新索引：

frame.reindex(columns=["year","state","pop"])

结果为：

	    year	state	pop
one	    2000	Ohio	1.5
two	    2001	Ohio	1.7
three	2002	Ohio	3.6
four	2001	Nevada	2.4
five	2002	Nevada	2.9
six	    2003	Nevada	3.2

reindex函数的各参数及说明为：

丢弃指定轴上的项

drop方法可以移除指定轴上的项，并返回新对象，默认删除行方向的内容：

frame.drop(["one","three"])

结果为：

        state	year	pop
two	    Ohio	2001	1.7
four	Nevada	2001	2.4
five	Nevada	2002	2.9
six	    Nevada	2003	3.2

也可以指定axis=1或axis=columns，移除列方向上的内容：

frame.drop(["year"],axis=1)

结果为：

	    state	pop
one	    Ohio	1.5
two	    Ohio	1.7
three	Ohio	3.6
four	Nevada	2.4
five	Nevada	2.9
six	    Nevada	3.2

另外可以通过设置inplace=True来就地修改数据：

frame.drop(["year"],axis=1,inplace=True)

此时frame的结果为：

	    state	pop
one	    Ohio	1.5
two	    Ohio	1.7
three	Ohio	3.6
four	Nevada	2.4
five	Nevada	2.9
six	    Nevada	3.2

此时不会返回新对象，而是会就地修改原数据。

索引、选取和过滤

之前提到，对于Series来说，可以通过index，数字下标和布尔操作的方式进行索引：

tmp = pd.Series([5,3,6,98,7],index=["three","four","two","one","five"])
print(tmp["three"],tmp[0],tmp[["three","four"]],tmp[[0,1]],tmp[0:2],tmp[tmp>5], sep="\n\n")

结果为：

5

5

three    5
four     3
dtype: int64

three    5
four     3
dtype: int64

three    5
four     3
dtype: int64

two      6
one     98
five     7
dtype: int64

利用index进行切片运算与python中的切片运算也有差异，其区间两端是闭合的：

tmp["three":"two"]

结果为：

three    5
four     3
two      6
dtype: int64

同时上面的几种索引方法均可以用来进行赋值。

对于DataFrame来说，使用值或序列对DataFrame进行索引就是获取一个或多个列：

data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

print(data["one"],data[["one","two"]], sep="\n\n")

结果为：

Ohio         0
Colorado     4
Utah         8
New York    12
Name: one, dtype: int32

          one  two
Ohio        0    1
Colorado    4    5
Utah        8    9
New York   12   13

而如果使用数字切片或者布尔数组获取的就是行数据，不过这种方式只能通过切片方式进行索引，不能通过数字下标和数字序列进行索引：

print(data[:2],data[data["one"]>5],sep="\n\n")

结果为：

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

同样DataFrame的索引方式也可以用来进行赋值。

用loc和iloc进行选取

之前提到，loc和iloc属性可以用来进行行数据的选取，不过loc使用轴标签，而iloc则是使用整数索引：

print(data.loc[["Ohio","Colorado","Utah"]], data.iloc[[0,1,2]], sep="\n\n")

结果为：

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11

并且loc和iloc可以接受两个序列，用来进行元素定位，类似于numpy的二维索引：

print(data.loc[["Ohio","Colorado","Utah"],["one","three","two"]], data.iloc[[0,1,2],[0,2,1]], sep="\n\n")

结果为：

          one  three  two
Ohio        0      2    1
Colorado    4      6    5
Utah        8     10    9

          one  three  two
Ohio        0      2    1
Colorado    4      6    5
Utah        8     10    9

同时这两个索引函数也适用于一个标签或多个标签的切片：

print(data.loc["Ohio":"Utah","one":"two"], data.iloc[0:3,0:3][data.one>4], sep="\n\n")

结果为：

          one  two
Ohio        0    1
Colorado    4    5
Utah        8    9

      one  two  three
Utah    8    9     10

DataFrame中的索引方式主要有：

算术运算和数据对齐

和之前提到的Series类似，不同索引对象之间的算术运算就是两者的并集：

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1+s2

结果为：

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

可以看到，在索引不重叠的部分会引入NaN，NaN会在算术运算过程中传播。

对于DataFrame来说，也是同样的效果：

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df1,df2,df1+df2,sep="\n\n")

结果为：

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

即DataFrame会将NaN引入到index和column中不重叠的部分，而NaN与任何元素的计算都是NaN。

在算术方法中填充值

之前提到，NaN与任何元素的计算都是NaN，而有时需要在计算得到的NaN中填充其它值，这就需要其它参数。

如当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值(比如0)：

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(df1,df2,df1.add(df2, fill_value=0), sep="\n\n")

结果为：

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

            b    c     d     e
Colorado  6.0  7.0   8.0   NaN
Ohio      3.0  1.0   6.0   5.0
Oregon    9.0  NaN  10.0  11.0
Texas     9.0  4.0  12.0   8.0
Utah      0.0  NaN   1.0   2.0

没有重叠的位置就会产生NA值。

下面列出了Series和DataFrame的算术方法。r开头的方法会翻转参数。因此df1.div(df2)和df2.rdiv(df1)是等价的。

DataFrame和Series之间的运算

由于pandas基于numpy，而numpy存在广播机制，因此DataFrame和Series也存在类似的机制。

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = pd.Series([1,2,3], index=list('bdf'))

print(frame, series, frame - series, sep="\n\n")

结果为：

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

b    1
d    2
f    3
dtype: int64

          b    d   e   f
Utah   -1.0 -1.0 NaN NaN
Ohio    2.0  2.0 NaN NaN
Texas   5.0  5.0 NaN NaN
Oregon  8.0  8.0 NaN NaN

可以看到，DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列，然后沿着行一直向下广播。而如果某个索引值在DataFrame的列或Series的索引中找不到，则参与运算的两个对象就会被重新索引以形成并集

而如果希望在列上进行广播，就必须使用算术运算方法：

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = pd.Series([1,2,3], index=['Utah', 'Ohio', 'Colorado'])

print(frame, series, frame.sub(series, axis=0), sep="\n\n")

结果为：

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

Utah        1
Ohio        2
Colorado    3
dtype: int64

            b    d    e
Colorado  NaN  NaN  NaN
Ohio      1.0  2.0  3.0
Oregon    NaN  NaN  NaN
Texas     NaN  NaN  NaN
Utah     -1.0  0.0  1.0

函数应用和映射

numpy的某些元素级数组方法也可以应用于pandas对象：

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame, np.abs(frame),sep="\n\n")

结果为：

               b         d         e
Utah    0.697534 -0.134106 -0.329323
Ohio   -0.907970 -0.448474 -0.498009
Texas  -0.327485  1.706504  1.692652
Oregon -0.129142  1.167562 -1.035300

               b         d         e
Utah    0.697534  0.134106  0.329323
Ohio    0.907970  0.448474  0.498009
Texas   0.327485  1.706504  1.692652
Oregon  0.129142  1.167562  1.035300

另外，还可以将函数应用到由各列或行所形成的一维数组上，这借助于DataFrame的apply方法即可实现：

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame, frame.apply(lambda x:x.max() - x.min()), frame.apply(lambda x:x.max() - x.min(), axis=1),sep="\n\n")

结果为：

               b         d         e
Utah    0.677041  0.330069 -0.593199
Ohio    0.022976  0.095901 -1.184464
Texas   0.379823  0.799666  1.980348
Oregon -0.240406 -0.948640 -0.573052

b    0.917448
d    1.748305
e    3.164812
dtype: float64

Utah      1.270240
Ohio      1.280365
Texas     1.600525
Oregon    0.708233
dtype: float64

而传递到apply的函数不是必须返回一个标量，还可以返回由多个值组成的Series：

def func(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame, frame.apply(func),frame.apply(func,axis=1),sep="\n\n")

结果为：

               b         d         e
Utah   -0.707579  0.565288  1.650140
Ohio   -2.921518 -0.063220  0.666870
Texas   0.395668  0.787787 -1.924530
Oregon -0.438882 -0.750354  0.168851

            b         d        e
min -2.921518 -0.750354 -1.92453
max  0.395668  0.787787  1.65014

             min       max
Utah   -0.707579  1.650140
Ohio   -2.921518  0.666870
Texas  -1.924530  0.787787
Oregon -0.750354  0.168851

apply可以应用到行或列数据上，而applymap可以应用到元素级的数据上：

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame, frame.applymap(lambda x: "%.2f" % x),sep="\n\n")

结果为：

               b         d         e
Utah   -0.217009 -0.330181 -0.008281
Ohio   -0.427106  1.840410  0.380936
Texas   0.993709  1.778476 -0.348506
Oregon -1.781456  0.092446  0.399635

            b      d      e
Utah    -0.22  -0.33  -0.01
Ohio    -0.43   1.84   0.38
Texas    0.99   1.78  -0.35
Oregon  -1.78   0.09   0.40

而对于Series对象来说，则是map方法：

series = pd.Series(np.random.randn(4))

print(series, series.map(lambda x: "%.2f" % x),sep="\n\n")

结果为：

0    0.912111
1   -1.035146
2   -2.283053
3   -0.891400
dtype: float64

0     0.91
1    -1.04
2    -2.28
3    -0.89
dtype: object

排序和排名

要对行或列索引进行排序，可使用sort_index方法：

series = pd.Series(np.random.randn(4), index = list("dbac"))

print(series, series.sort_index(),sep="\n\n")

结果为：

d   -2.972148
b   -0.374894
a   -0.185849
c    0.571640
dtype: float64

a   -0.185849
b   -0.374894
c    0.571640
d   -2.972148
dtype: float64

sort_index是按照index排序的，而sort_values则是按照value排序的：

series = pd.Series(np.random.randn(4), index = list("dbac"))

print(series, series.sort_values(),sep="\n\n")

结果为：

d   -0.063629
b    1.178513
a    0.562040
c   -1.761235
dtype: float64

c   -1.761235
d   -0.063629
a    0.562040
b    1.178513
dtype: float64

对于DataFrame，则可以根据任意一个轴上的索引进行排序：

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=list("dbac"))

print(frame, frame.sort_index(),frame.sort_index(axis=1),frame.sort_values(by=["b","d"]),frame.sort_values(by=["a","c"],axis=1),sep="\n\n")

结果为：

          d         b         e
d -0.261446  1.342736  0.431499
b  2.020019  1.143758 -2.011755
a -1.059436  0.533637 -1.694812
c  0.017930  0.197699  1.062465

          d         b         e
a -1.059436  0.533637 -1.694812
b  2.020019  1.143758 -2.011755
c  0.017930  0.197699  1.062465
d -0.261446  1.342736  0.431499

          b         d         e
d  1.342736 -0.261446  0.431499
b  1.143758  2.020019 -2.011755
a  0.533637 -1.059436 -1.694812
c  0.197699  0.017930  1.062465

          d         b         e
c  0.017930  0.197699  1.062465
a -1.059436  0.533637 -1.694812
b  2.020019  1.143758 -2.011755
d -0.261446  1.342736  0.431499

          e         d         b
d  0.431499 -0.261446  1.342736
b -2.011755  2.020019  1.143758
a -1.694812 -1.059436  0.533637
c  1.062465  0.017930  0.197699

同样sort_index表示按index排序，参数axis表示方向。sort_values表示按value排序，参数axis表示方向，参数by表示排序使用的列。参数ascending表示使用升序还是降序，这里没有体现。

在排序时，NaN默认都会被放到末尾。

而rank方法则是为各组分配一个平均排名：

series = pd.Series([7, -5, 7, 4, 2, 0, 4])
series.rank()

结果为：

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

也可以根据值在原数据中出现的顺序给出排名：

series = pd.Series([7, -5, 7, 4, 2, 0, 4])
series.rank(method="first")

结果为：

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

这里，条目0和2没有使用平均排名6.5，它们被设成了6和7，因为数据中标签0位于标签2的前面。同样参数ascending表示使用升序还是降序，这里没有体现。

rank方法对于DataFrame也是有效的：

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=list("dbac"))

print(frame, frame.rank(),sep="\n\n")

结果为：

          d         b         e
d -1.120805  0.172054  0.470200
b -0.108554  0.161372  1.401574
a -1.083892 -0.726145 -0.323556
c  0.502917 -2.040901  0.160475

     d    b    e
d  1.0  4.0  3.0
b  3.0  3.0  4.0
a  2.0  2.0  1.0
c  4.0  1.0  2.0

下表为相关的method选项：

带有重复标签的轴索引

之前提到，index可以是重复的：

series = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
index = series.index
print(index, index.is_unique,sep="\n\n")

结果为：

Index(['a', 'a', 'b', 'b', 'c'], dtype='object')

False

而如果获取重复索引的话，则会得到Series对象：

series["a"]

结果为：

a    0
a    1
dtype: int64

而非重复索引则会返回一个标量值：

series["c"]

结果为：

这在DataFrame中也是类似的：

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=list("aabc"))

print(frame.loc["a"])

结果为：

          d         b         e
a  1.204948 -0.323096  0.718391
a  1.251217 -0.916471 -1.732916

汇总和计算描述统计

这里看几个pandas方法：

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('abc'), index=list("defg"))

print(frame,frame.sum(),frame.mean(),frame.idxmax(),frame.idxmin(),frame.cumsum(),frame.describe(),sep="\n\n")

结果为：

          a         b         c
d -1.125236 -0.285921 -0.805632
e -0.497351  1.246323  1.887993
f -0.588805  2.746995 -1.245783
g  0.934575  0.346072 -1.075893

a   -1.276816
b    4.053470
c   -1.239315
dtype: float64

a   -0.319204
b    1.013367
c   -0.309829
dtype: float64

a    g
b    f
c    e
dtype: object

a    d
b    d
c    f
dtype: object

          a         b         c
d -1.125236 -0.285921 -0.805632
e -1.622586  0.960402  1.082361
f -2.211391  3.707397 -0.163422
g -1.276816  4.053470 -1.239315

              a         b         c
count  4.000000  4.000000  4.000000
mean  -0.319204  1.013367 -0.309829
std    0.880543  1.315696  1.476381
min   -1.125236 -0.285921 -1.245783
25%   -0.722913  0.188074 -1.118365
50%   -0.543078  0.796198 -0.940762
75%   -0.139369  1.621491 -0.132226
max    0.934575  2.746995  1.887993

上面约简方法的常用选项有：

pandas中与描述统计相关的方法有：

唯一值、值计数以及成员资格

Series对象的unique方法会返回value的集合，

series = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print(series.values,series.unique(), sep="\n\n")

结果为：

['c' 'a' 'd' 'a' 'a' 'b' 'b' 'c' 'c']

['c' 'a' 'd' 'b']

而value_counts方法用于计算一个Series中各值出现的频率：

series.value_counts()

结果为：

c    3
a    3
b    2
d    1
dtype: int64

而isin用于判断矢量化集合的成员资格，可用于过滤Series中或DataFrame列中数据的子集：

print(series.isin(list("ab")), series[series.isin(list("ab"))], sep="\n\n")

结果为：

0    False
1     True
2    False
3     True
4     True
5     True
6     True
7    False
8    False
dtype: bool

1    a
3    a
4    a
5    b
6    b
dtype: object

而Index.get_indexer方法会返回一个索引数组，从可能包含重复值的数组到另一个不同值的数组：

series1 = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
series2 = pd.Series(['c', 'b', 'a'])

pd.Index(series2).get_indexer(series1)

结果为：

array([0, 2, 1, 1, 0, 2], dtype=int64)

这些方法的介绍为：

上面的内容也可以综合起来：

data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]})
print(data.apply(pd.value_counts).fillna(0))

结果为：

   Qu1  Qu2  Qu3
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0
5  0.0  0.0  1.0

这里计算的是行方向上的各个值的数目，并将不存在的值赋值为0。

止步听风

关注

18
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
利用python进行数据分析——pandas基础

这里主要是对《利用python进行数据分析》的学习，原书的电子版地址为：不知道这个项目是不是译者或者是什么好心人整理的。
复制链接

扫一扫

专栏目录

利用python进行数据分析——pandas基础

数据结构

Series

DataFrame

索引对象

基本功能

重新索引

丢弃指定轴上的项

索引、选取和过滤

用loc和iloc进行选取

算术运算和数据对齐

在算术方法中填充值

DataFrame和Series之间的运算

函数应用和映射

排序和排名

带有重复标签的轴索引

汇总和计算描述统计

相关系数与协方差

唯一值、值计数以及成员资格