2.2 pandas 基本功能_pandas apply if-CSDN博客

本文介绍了Pandas库中Series和DataFrame的数据操作，包括重新索引、丢弃指定轴上的项、索引、选取和过滤。详细讲解了reindex方法的使用，以及drop函数删除数据。此外，还阐述了算数运算和数据对齐的过程，如加法、减法等，并展示了如何处理缺失值。最后，讨论了DataFrame和Series之间的运算、函数应用和映射，以及排序和排名的方法，如sort_index、sort_values和rank。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

2.2 基本功能

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

2.2.1 重新索引

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d','b','a','c'])

obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

reindex

根据索引重新排序
有缺失值补NaN

obj2 = obj.reindex(['a','b','c','d','e'])

obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

method选项

method=‘ffill’

obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4])

obj3

0      blue
2    purple
4    yellow
dtype: object

obj3.reindex(np.arange(6),method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

对DataFrame使用

frame = pd.DataFrame(np.arange(9).reshape(3,3), index=['a','c','d'], columns=['Ohio','Texas','California'])

frame

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

frame2 = frame.reindex(['a','b','c','d'])

frame2

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

* columns可以索引列（对DataFrame来说）

frame2.reindex(columns=['Texas','Utah','California'])

	Texas	Utah	California
a	1.0	NaN	2.0
b	NaN	NaN	NaN
c	4.0	NaN	5.0
d	7.0	NaN	8.0

reindex的各个参数的说明

参数	说明
index	用作索引的新序列。数据类型可以是Index，也可以是np，也可以是其他的序列。有被索引到的都会被添加进去（原本没有的补NaN），没有索引到的就算原来有也会消失
method	插值（填充）方式，有许多可以选择
fill_value	填充缺失值的替代值
limit	前向或后向填充时的最大填充量
tolerance	向前或向后填充时，填充不准确匹配的最大间距（绝对值举例）
level	在MultiIndex的指定级别上匹配简单索引，否则选取其子集
copy	默认为True，如果为False，则新旧Index相同旧不复制了

2.2.2 丢弃指定轴上的项

drop（2.1讲del的时候有提到过了）

对Series

obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])

obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

new_obj = obj.drop('c')

new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

obj.drop(['d','c'])

a    0.0
b    1.0
e    4.0
dtype: float64

对DataFrame

data = pd.DataFrame(np.arange(16).reshape(4,4), index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four'])

data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

#删除行，默认axis=0
data.drop(['Colorado','Ohio'])

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

#删除列（axis=1）
data.drop(['two','four'],axis=1)

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

#删除列（axis=columns）
data.drop(['two','four'],axis='columns')

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

inplace选项

obj.drop('c',inplace=True) #直接操作原本的obj，彻底删除原有位置的数据而不会返回新的

obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

2.2.3 索引、选取和过滤

obj = pd.Series(np.arange(4.),index=['a','b','c','d'])

obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

一些例子

Series

obj['b'] #用index

1.0

obj[1]

1.0

obj[2:4]

c    2.0
d    3.0
dtype: float64

obj[['b','a','d']] #多个indices

b    1.0
a    0.0
d    3.0
dtype: float64

obj[[1,3]]

b    1.0
d    3.0
dtype: float64

obj[obj<2] #通过比较

a    0.0
b    1.0
dtype: float64

利用标签的切片运算与普通的Python切片运算不同，其末端是包含的

obj['a':'c']

a    0.0
b    1.0
c    2.0
dtype: float64

obj['b':'c']=5 #直接进行设置

obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

obj['b':'c']=[6,6]

obj

a    0.0
b    6.0
c    6.0
d    3.0
dtype: float64

DataFrame

data = pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four'])

data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

data[['three','one']]

	three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12

data[:2]

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

data[data['three']>5] #筛选标签为'three'的那列大于5的所有行

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data < 5

	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False

data[data<5]=0

data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

loc和iloc（DataFrame专属）

loc：轴标签
iloc：用整数（列号，行号）索引标签

data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data.loc[['Colorado','New York'],['two','three']] #直接筛选出两行和两列

	two	three
Colorado	5	6
New York	13	14

data.iloc[[1,2],[1,2]] #直接筛选出两行和两列

	two	three
Colorado	5	6
Utah	9	10

data.iloc[2] #直接筛选出第二行

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

其他索引方式

data.loc[:'Utah','two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

data.loc[:,'two':'four']

	two	three	four
Ohio	0	0	0
Colorado	5	6	7
Utah	9	10	11
New York	13	14	15

data.loc[:,'two':'four'][data>6] #组合写法

	two	three	four
Ohio	NaN	NaN	NaN
Colorado	NaN	NaN	7.0
Utah	9.0	10.0	11.0
New York	13.0	14.0	15.0

data.iloc[1:3,2:4][data < 8]

	three	four
Colorado	6.0	7.0
Utah	NaN	NaN

data.get_value(1,2)

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-73-c2c89d22c563> in <module>
----> 1 data.get_value(1,2)


F:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5273                 return self[name]
-> 5274             return object.__getattribute__(self, name)
   5275 
   5276     def __setattr__(self, name: str, value) -> None:


AttributeError: 'DataFrame' object has no attribute 'get_value'

整数索引

pandas对象的索引与python有点不一样，举个例子：下面这种写法是错的

ser = pd.Series(np.arange(3.))

ser

0    0.0
1    1.0
2    2.0
dtype: float64

ser[-1] # KeyError: -1

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-76-44969a759c20> in <module>
----> 1 ser[-1]


F:\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    869         key = com.apply_if_callable(key, self)
    870         try:
--> 871             result = self.index.get_value(self, key)
    872 
    873             if not is_scalar(result):


F:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   4403         k = self._convert_scalar_indexer(k, kind="getitem")
   4404         try:
-> 4405             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4406         except KeyError as e1:
   4407             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()


pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()


KeyError: -1

错误的原因是：索引可能包含了-1，那么ser[-1]到底代表的是索引中的-1，还是倒数第一行呢？
但是如果当索引是字母，那么就可以这么用了，如下：

ser2 = pd.Series(np.arange(3.),index=['a','b','c'])

ser2

a    0.0
b    1.0
c    2.0
dtype: float64

ser2[-1]

2.0

最后，还是推荐用loc或iloc等方式

总结

类型	说明
df[val]	从DataFrame选取单列或一组列（对布尔型数组（过滤行），切片（行切片）或布尔型DataFrame（根据条件设置值）比较友好）
df.loc[val]	通过标签选取DataFrame的单行或多行
df.loc[:,val]	通过标签选取单列或多列
df.loc[val1,val2]	通过标签同时选取行和列
df.iloc[where]	通过整数选取DataFrame的单行或多行
df.iloc[:,where]	通过整数选取单列或多列
df.iloc[where_i,where_j]	通过整数同时选取行和列
df.at[label_i,label_j]	通过行和列的标签选取一个数据
df.iat[i,j]	通过行和列的整数选取一个数据
df.reindex	通过标签选取行或列，而且可以按照自己想要的顺序重新排序

get_value和set_value已经被deprecated了，建议用at和iat了

2.2.4 算数运算和数据对齐

对Series来说是列相加，对齐
对DataFrame来说行列都会相加，对齐
标签未匹配的补NaN

Series

s1 = pd.Series(np.arange(4),index=['a','c','d','e'])

s2 = pd.Series(np.arange(2,6),index=['a','b','c','d'])

s1

a    0
c    1
d    2
e    3
dtype: int32

s2

a    2
b    3
c    4
d    5
dtype: int32

s1+s2

a    2.0
b    NaN
c    5.0
d    7.0
e    NaN
dtype: float64

DataFrame

df1 = pd.DataFrame(np.arange(9).reshape(3,3),index=['Ohio','Texas','Colorado'],columns=['b','c','d'])

df2 = pd.DataFrame(np.arange(12.).reshape(4,3),index=['Utah','Ohio','Texas','Oregon'],columns=['b','d','e'])

df1

	b	c	d
Ohio	0	1	2
Texas	3	4	5
Colorado	6	7	8

df2

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

df1+df2

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

2.2.4 在算数方法中填充值（fill_value参数）

df1 = pd.DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20).reshape(4,5),columns=list('abcde'))

df1

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11

df2

	a	b	c	d	e
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14
3	15	16	17	18	19

df1+df2

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	11.0	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

使用df1的add方法传入df2和fill_value参数

df1.add(df2,fill_value=0)

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	11.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

1/df1

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250	0.200000	0.166667	0.142857
2	0.125	0.111111	0.100000	0.090909

df1.rdiv(1)

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250	0.200000	0.166667	0.142857
2	0.125	0.111111	0.100000	0.090909

总结

方法	说明
add, radd	+
sub, rsub	-
mul, rmul	*
div, rdiv	/
floordiv, rfloordiv	//
pow, rpow	**

**注意：**方法名前加一个r代表翻转参数，比如1/df1等同于df1.rdiv(1)

重新索引（reindex）的时候也可以使用fill_value参数

df1.reindex(columns=df2.columns,fill_value=0)

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11

2.2.5 DataFrame和Series之间的运算

默认在行上广播（broadcasting）
如果某个索引值在参与运算的两个对象其中之一里找不到，同样会被重新索引并补上NaN
实现在列上广播要定义axis参数为’index’或0

frame =pd.DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])

series = frame.iloc[0]

frame

	b	d	e
Utah	0	1	2
Ohio	3	4	5
Texas	6	7	8
Oregon	9	10	11

series

b    0
d    1
e    2
Name: Utah, dtype: int32

frame-series

	b	d	e
Utah	0	0	0
Ohio	3	3	3
Texas	6	6	6
Oregon	9	9	9

从上面的例子可以看出，每一行的每一列都分别减了0，1，2

如果某个索引值在参与运算的两个对象其中之一里找不到，同样会被重新索引并补上NaN

例如：

series2 = pd.Series(np.arange(3),index=['b','e','f'])

frame + series2

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

实现在列上广播要定义axis参数为’index’或0

series3 = frame.iloc[:,1]

series3

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: int32

frame.sub(series3, axis=0)

	b	e
Utah	-1	1
Ohio	-1	1
Texas	-1	1
Oregon	-1	1

2.2.5 函数应用和映射

NumPy的ufuncs（元素级数组方法）可以使用
apply()方法，可以使用自定义函数
applymap()方法，可以使用自定义函数来调整字符格式

frame = pd.DataFrame(np.arange(-6,6,).reshape(4,3), index=['Utah','Ohio','Texas','Oregon'])

frame

	0	1	2
Utah	-6	-5	-4
Ohio	-3	-2	-1
Texas	0	1	2
Oregon	3	4	5

np.abs(frame)

	0	1	2
Utah	6	5	4
Ohio	3	2	1
Texas	0	1	2
Oregon	3	4	5

自定义f

f = lambda x: x.max()-x.min()

frame.apply(f) #默认axis=0（对列求）

0    9
1    9
2    9
dtype: int64

frame.apply(f, axis=1)

Utah      2
Ohio      2
Texas     2
Oregon    2
dtype: int64

def f(x):
    return pd.Series([x.min(), x.max()],index=['min','max'])

frame.apply(f)

	0	1	2
min	-6	-5	-4
max	3	4	5

format1 = lambda x: '%.2f' % x

map和applymap

frame.loc['Utah'].map(format1)

0    -6.00
1    -5.00
2    -4.00
Name: Utah, dtype: object

applymap是由map引申出来的

frame.applymap(format1)

	0	1	2
Utah	-6.00	-5.00	-4.00
Ohio	-3.00	-2.00	-1.00
Texas	0.00	1.00	2.00
Oregon	3.00	4.00	5.00

2.2.6 排序和排名

sort_index方法:对标签排序
sort_values方法：对值排序
rank方法：为各组分配一个排名

obj = pd.Series(np.arange(4), index = ['d','a','c','b'])

obj

d    0
a    1
c    2
b    3
dtype: int32

frame = pd.DataFrame(np.random.randint(100,size=(2,4)), index=[2,0], columns=['d','a','b','c'])

frame

	d	a	b	c
2	71	23	93	25
0	87	25	66	1

sort_index

obj.sort_index()

a    1
b    3
c    2
d    0
dtype: int32

frame.sort_index() #默认对行升序排序

	d	a	b	c
0	87	25	66	1
2	71	23	93	25

frame.sort_index(axis=1) #对列排序

	a	b	c	d
2	23	93	25	71
0	25	66	1	87

frame.sort_index(axis=1,ascending=False) #对列降序排序

	d	c	b	a
2	71	25	93	23
0	87	1	66	25

sort_values

obj.sort_values()

d    0
a    1
c    2
b    3
dtype: int32

frame2 = pd.DataFrame(np.random.randint(100,size=(4,4)),index=['4','2','1','3'], columns=['d','b','a','c'])

frame2

	d	b	a	c
4	71	57	78	20
2	89	92	67	70
1	43	9	23	27
3	76	41	14	77

frame2.sort_values(by='b')

	d	b	a	c
1	43	9	23	27
3	76	41	14	77
4	71	57	78	20
2	89	92	67	70

frame2.sort_values(by=list('ab'))

	d	b	a	c
3	76	41	14	77
1	43	9	23	27
2	89	92	67	70
4	71	57	78	20

rank

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

obj.rank(method='first') #根据值在原数据中出现的顺序，在前面的先出现，所以编号比较大，例如：第一个7是6.0和第二个7是7.0

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

obj.rank(method='min')

0    6.0
1    1.0
2    6.0
3    4.0
4    3.0
5    2.0
6    4.0
dtype: float64

obj.rank(method='max')

0    7.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    5.0
dtype: float64

obj.rank(method='dense')

0    5.0
1    1.0
2    5.0
3    4.0
4    3.0
5    2.0
6    4.0
dtype: float64

总结：排名时用于破坏平级关系的方法

方法	说明
‘average’	在相等分组中，为各个值平均分配排名（默认，会出现.5的那种，上面例子中，两个7都被分配了6.5）
‘min’	在相等分组中，使用最小排名（上面例子中，两个7都被分配了7）
‘max’	在相等分组中，使用最大排名（上面例子中，两个7都被分配了6）
‘first’	按值在原始数据中的出现顺序分配排名（上面例子中，前面的7被分配了6，后面的7被分配了7）
‘dense’	类似min，但排名在组间增加1，而不是组中相同元素(（上面例子中，上面例子中，两个7都被分配了5）)

2.2.7 带有重复标签的轴索引

obj = pd.Series(np.arange(5),index=['a','a','b','b','c']) #可以有多个相同轴索引

obj

a    0
a    1
b    2
b    3
c    4
dtype: int32

is_unique：索引是否是唯一的

obj.index.is_unique

False

有重复的索引会一并被选出来

obj['a']

a    0
a    1
dtype: int32

obj[['a','b']]

a    0
a    1
b    2
b    3
dtype: int32

df = pd.DataFrame(np.random.randint(100,size=(4,3)),index=list('aabb'))

df

	0	1	2
a	31	31	44
a	90	5	92
b	84	6	40
b	32	32	38

df.loc['b']

	0	1	2
b	84	6	40
b	32	32	38

df.loc[['a','b']]

	0	1	2
a	31	31	44
a	90	5	92
b	84	6	40
b	32	32	38