2.2 基本功能
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
2.2.1 重新索引
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index = ['d','b','a','c'])
obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
reindex
obj2 = obj.reindex(['a','b','c','d','e'])
obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
method选项
obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4])
obj3
0 blue
2 purple
4 yellow
dtype: object
obj3.reindex(np.arange(6),method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
对DataFrame使用
frame = pd.DataFrame(np.arange(9).reshape(3,3), index=['a','c','d'], columns=['Ohio','Texas','California'])
frame
| Ohio | Texas | California |
---|
a | 0 | 1 | 2 |
---|
c | 3 | 4 | 5 |
---|
d | 6 | 7 | 8 |
---|
frame2 = frame.reindex(['a','b','c','d'])
frame2
| Ohio | Texas | California |
---|
a | 0.0 | 1.0 | 2.0 |
---|
b | NaN | NaN | NaN |
---|
c | 3.0 | 4.0 | 5.0 |
---|
d | 6.0 | 7.0 | 8.0 |
---|
* columns可以索引列(对DataFrame来说)
frame2.reindex(columns=['Texas','Utah','California'])
| Texas | Utah | California |
---|
a | 1.0 | NaN | 2.0 |
---|
b | NaN | NaN | NaN |
---|
c | 4.0 | NaN | 5.0 |
---|
d | 7.0 | NaN | 8.0 |
---|
reindex的各个参数的说明
参数 | 说明 |
---|
index | 用作索引的新序列。数据类型可以是Index,也可以是np,也可以是其他的序列。有被索引到的都会被添加进去(原本没有的补NaN),没有索引到的就算原来有也会消失 |
method | 插值(填充)方式,有许多可以选择 |
fill_value | 填充缺失值的替代值 |
limit | 前向或后向填充时的最大填充量 |
tolerance | 向前或向后填充时,填充不准确匹配的最大间距(绝对值举例) |
level | 在MultiIndex的指定级别上匹配简单索引,否则选取其子集 |
copy | 默认为True,如果为False,则新旧Index相同旧不复制了 |
2.2.2 丢弃指定轴上的项
对Series
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj.drop(['d','c'])
a 0.0
b 1.0
e 4.0
dtype: float64
对DataFrame
data = pd.DataFrame(np.arange(16).reshape(4,4), index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four'])
data
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data.drop(['Colorado','Ohio'])
| one | two | three | four |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data.drop(['two','four'],axis=1)
| one | three |
---|
Ohio | 0 | 2 |
---|
Colorado | 4 | 6 |
---|
Utah | 8 | 10 |
---|
New York | 12 | 14 |
---|
data.drop(['two','four'],axis='columns')
| one | three |
---|
Ohio | 0 | 2 |
---|
Colorado | 4 | 6 |
---|
Utah | 8 | 10 |
---|
New York | 12 | 14 |
---|
inplace选项
obj.drop('c',inplace=True)
obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
2.2.3 索引、选取和过滤
obj = pd.Series(np.arange(4.),index=['a','b','c','d'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
一些例子
Series
obj['b']
1.0
obj[1]
1.0
obj[2:4]
c 2.0
d 3.0
dtype: float64
obj[['b','a','d']]
b 1.0
a 0.0
d 3.0
dtype: float64
obj[[1,3]]
b 1.0
d 3.0
dtype: float64
obj[obj<2]
a 0.0
b 1.0
dtype: float64
利用标签的切片运算与普通的Python切片运算不同,其末端是包含的
obj['a':'c']
a 0.0
b 1.0
c 2.0
dtype: float64
obj['b':'c']=5
obj
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
obj['b':'c']=[6,6]
obj
a 0.0
b 6.0
c 6.0
d 3.0
dtype: float64
DataFrame
data = pd.DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four'])
data
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[['three','one']]
| three | one |
---|
Ohio | 2 | 0 |
---|
Colorado | 6 | 4 |
---|
Utah | 10 | 8 |
---|
New York | 14 | 12 |
---|
data[:2]
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
data[data['three']>5]
| one | two | three | four |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data < 5
| one | two | three | four |
---|
Ohio | True | True | True | True |
---|
Colorado | True | False | False | False |
---|
Utah | False | False | False | False |
---|
New York | False | False | False | False |
---|
data[data<5]=0
data
| one | two | three | four |
---|
Ohio | 0 | 0 | 0 | 0 |
---|
Colorado | 0 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
loc和iloc(DataFrame专属)
- loc:轴标签
- iloc:用整数(列号,行号)索引标签
data
| one | two | three | four |
---|
Ohio | 0 | 0 | 0 | 0 |
---|
Colorado | 0 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data.loc[['Colorado','New York'],['two','three']]
| two | three |
---|
Colorado | 5 | 6 |
---|
New York | 13 | 14 |
---|
data.iloc[[1,2],[1,2]]
| two | three |
---|
Colorado | 5 | 6 |
---|
Utah | 9 | 10 |
---|
data.iloc[2]
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
其他索引方式
data.loc[:'Utah','two']
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
data.loc[:,'two':'four']
| two | three | four |
---|
Ohio | 0 | 0 | 0 |
---|
Colorado | 5 | 6 | 7 |
---|
Utah | 9 | 10 | 11 |
---|
New York | 13 | 14 | 15 |
---|
data.loc[:,'two':'four'][data>6]
| two | three | four |
---|
Ohio | NaN | NaN | NaN |
---|
Colorado | NaN | NaN | 7.0 |
---|
Utah | 9.0 | 10.0 | 11.0 |
---|
New York | 13.0 | 14.0 | 15.0 |
---|
data.iloc[1:3,2:4][data < 8]
| three | four |
---|
Colorado | 6.0 | 7.0 |
---|
Utah | NaN | NaN |
---|
data.get_value(1,2)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-73-c2c89d22c563> in <module>
----> 1 data.get_value(1,2)
F:\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.__getattribute__(self, name)
5275
5276 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'get_value'
整数索引
pandas对象的索引与python有点不一样,举个例子:下面这种写法是错的
ser = pd.Series(np.arange(3.))
ser
0 0.0
1 1.0
2 2.0
dtype: float64
ser[-1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-76-44969a759c20> in <module>
----> 1 ser[-1]
F:\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
869 key = com.apply_if_callable(key, self)
870 try:
--> 871 result = self.index.get_value(self, key)
872
873 if not is_scalar(result):
F:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
4403 k = self._convert_scalar_indexer(k, kind="getitem")
4404 try:
-> 4405 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4406 except KeyError as e1:
4407 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: -1
错误的原因是:索引可能包含了-1,那么ser[-1]到底代表的是索引中的-1,还是倒数第一行呢?
但是如果当索引是字母,那么就可以这么用了,如下:
ser2 = pd.Series(np.arange(3.),index=['a','b','c'])
ser2
a 0.0
b 1.0
c 2.0
dtype: float64
ser2[-1]
2.0
最后,还是推荐用loc或iloc等方式
总结
类型 | 说明 |
---|
df[val] | 从DataFrame选取单列或一组列(对布尔型数组(过滤行),切片(行切片)或布尔型DataFrame(根据条件设置值)比较友好) |
df.loc[val] | 通过标签选取DataFrame的单行或多行 |
df.loc[:,val] | 通过标签选取单列或多列 |
df.loc[val1,val2] | 通过标签同时选取行和列 |
df.iloc[where] | 通过整数选取DataFrame的单行或多行 |
df.iloc[:,where] | 通过整数选取单列或多列 |
df.iloc[where_i,where_j] | 通过整数同时选取行和列 |
df.at[label_i,label_j] | 通过行和列的标签选取一个数据 |
df.iat[i,j] | 通过行和列的整数选取一个数据 |
df.reindex | 通过标签选取行或列,而且可以按照自己想要的顺序重新排序 |
get_value和set_value已经被deprecated了,建议用at和iat了
2.2.4 算数运算和数据对齐
- 对Series来说是列相加,对齐
- 对DataFrame来说行列都会相加,对齐
- 标签未匹配的补NaN
Series
s1 = pd.Series(np.arange(4),index=['a','c','d','e'])
s2 = pd.Series(np.arange(2,6),index=['a','b','c','d'])
s1
a 0
c 1
d 2
e 3
dtype: int32
s2
a 2
b 3
c 4
d 5
dtype: int32
s1+s2
a 2.0
b NaN
c 5.0
d 7.0
e NaN
dtype: float64
DataFrame
df1 = pd.DataFrame(np.arange(9).reshape(3,3),index=['Ohio','Texas','Colorado'],columns=['b','c','d'])
df2 = pd.DataFrame(np.arange(12.).reshape(4,3),index=['Utah','Ohio','Texas','Oregon'],columns=['b','d','e'])
df1
| b | c | d |
---|
Ohio | 0 | 1 | 2 |
---|
Texas | 3 | 4 | 5 |
---|
Colorado | 6 | 7 | 8 |
---|
df2
| b | d | e |
---|
Utah | 0.0 | 1.0 | 2.0 |
---|
Ohio | 3.0 | 4.0 | 5.0 |
---|
Texas | 6.0 | 7.0 | 8.0 |
---|
Oregon | 9.0 | 10.0 | 11.0 |
---|
df1+df2
| b | c | d | e |
---|
Colorado | NaN | NaN | NaN | NaN |
---|
Ohio | 3.0 | NaN | 6.0 | NaN |
---|
Oregon | NaN | NaN | NaN | NaN |
---|
Texas | 9.0 | NaN | 12.0 | NaN |
---|
Utah | NaN | NaN | NaN | NaN |
---|
2.2.4 在算数方法中填充值(fill_value参数)
df1 = pd.DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20).reshape(4,5),columns=list('abcde'))
df1
df2
| a | b | c | d | e |
---|
0 | 0 | 1 | 2 | 3 | 4 |
---|
1 | 5 | 6 | 7 | 8 | 9 |
---|
2 | 10 | 11 | 12 | 13 | 14 |
---|
3 | 15 | 16 | 17 | 18 | 19 |
---|
df1+df2
| a | b | c | d | e |
---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | NaN |
---|
1 | 9.0 | 11.0 | 13.0 | 15.0 | NaN |
---|
2 | 18.0 | 20.0 | 22.0 | 24.0 | NaN |
---|
3 | NaN | NaN | NaN | NaN | NaN |
---|
使用df1的add方法传入df2和fill_value参数
df1.add(df2,fill_value=0)
| a | b | c | d | e |
---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
---|
1 | 9.0 | 11.0 | 13.0 | 15.0 | 9.0 |
---|
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
---|
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
---|
1/df1
| a | b | c | d |
---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
---|
1 | 0.250 | 0.200000 | 0.166667 | 0.142857 |
---|
2 | 0.125 | 0.111111 | 0.100000 | 0.090909 |
---|
df1.rdiv(1)
| a | b | c | d |
---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
---|
1 | 0.250 | 0.200000 | 0.166667 | 0.142857 |
---|
2 | 0.125 | 0.111111 | 0.100000 | 0.090909 |
---|
总结
方法 | 说明 |
---|
add, radd | + |
sub, rsub | - |
mul, rmul | * |
div, rdiv | / |
floordiv, rfloordiv | // |
pow, rpow | ** |
**注意:**方法名前加一个r代表翻转参数,比如1/df1等同于df1.rdiv(1)
重新索引(reindex)的时候也可以使用fill_value参数
df1.reindex(columns=df2.columns,fill_value=0)
| a | b | c | d | e |
---|
0 | 0 | 1 | 2 | 3 | 0 |
---|
1 | 4 | 5 | 6 | 7 | 0 |
---|
2 | 8 | 9 | 10 | 11 | 0 |
---|
2.2.5 DataFrame和Series之间的运算
- 默认在行上广播(broadcasting)
- 如果某个索引值在参与运算的两个对象其中之一里找不到,同样会被重新索引并补上NaN
- 实现在列上广播要定义axis参数为’index’或0
frame =pd.DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
series = frame.iloc[0]
frame
| b | d | e |
---|
Utah | 0 | 1 | 2 |
---|
Ohio | 3 | 4 | 5 |
---|
Texas | 6 | 7 | 8 |
---|
Oregon | 9 | 10 | 11 |
---|
series
b 0
d 1
e 2
Name: Utah, dtype: int32
frame-series
| b | d | e |
---|
Utah | 0 | 0 | 0 |
---|
Ohio | 3 | 3 | 3 |
---|
Texas | 6 | 6 | 6 |
---|
Oregon | 9 | 9 | 9 |
---|
从上面的例子可以看出,每一行的每一列都分别减了0,1,2
如果某个索引值在参与运算的两个对象其中之一里找不到,同样会被重新索引并补上NaN
例如:
series2 = pd.Series(np.arange(3),index=['b','e','f'])
frame + series2
| b | d | e | f |
---|
Utah | 0.0 | NaN | 3.0 | NaN |
---|
Ohio | 3.0 | NaN | 6.0 | NaN |
---|
Texas | 6.0 | NaN | 9.0 | NaN |
---|
Oregon | 9.0 | NaN | 12.0 | NaN |
---|
实现在列上广播要定义axis参数为’index’或0
series3 = frame.iloc[:,1]
series3
Utah 1
Ohio 4
Texas 7
Oregon 10
Name: d, dtype: int32
frame.sub(series3, axis=0)
| b | d | e |
---|
Utah | -1 | 0 | 1 |
---|
Ohio | -1 | 0 | 1 |
---|
Texas | -1 | 0 | 1 |
---|
Oregon | -1 | 0 | 1 |
---|
2.2.5 函数应用和映射
- NumPy的ufuncs(元素级数组方法)可以使用
- apply()方法,可以使用自定义函数
- applymap()方法,可以使用自定义函数来调整字符格式
frame = pd.DataFrame(np.arange(-6,6,).reshape(4,3), index=['Utah','Ohio','Texas','Oregon'])
frame
| 0 | 1 | 2 |
---|
Utah | -6 | -5 | -4 |
---|
Ohio | -3 | -2 | -1 |
---|
Texas | 0 | 1 | 2 |
---|
Oregon | 3 | 4 | 5 |
---|
np.abs(frame)
| 0 | 1 | 2 |
---|
Utah | 6 | 5 | 4 |
---|
Ohio | 3 | 2 | 1 |
---|
Texas | 0 | 1 | 2 |
---|
Oregon | 3 | 4 | 5 |
---|
自定义f
f = lambda x: x.max()-x.min()
frame.apply(f)
0 9
1 9
2 9
dtype: int64
frame.apply(f, axis=1)
Utah 2
Ohio 2
Texas 2
Oregon 2
dtype: int64
def f(x):
return pd.Series([x.min(), x.max()],index=['min','max'])
frame.apply(f)
format1 = lambda x: '%.2f' % x
map和applymap
frame.loc['Utah'].map(format1)
0 -6.00
1 -5.00
2 -4.00
Name: Utah, dtype: object
applymap是由map引申出来的
frame.applymap(format1)
| 0 | 1 | 2 |
---|
Utah | -6.00 | -5.00 | -4.00 |
---|
Ohio | -3.00 | -2.00 | -1.00 |
---|
Texas | 0.00 | 1.00 | 2.00 |
---|
Oregon | 3.00 | 4.00 | 5.00 |
---|
2.2.6 排序和排名
- sort_index方法:对标签排序
- sort_values方法:对值排序
- rank方法:为各组分配一个排名
obj = pd.Series(np.arange(4), index = ['d','a','c','b'])
obj
d 0
a 1
c 2
b 3
dtype: int32
frame = pd.DataFrame(np.random.randint(100,size=(2,4)), index=[2,0], columns=['d','a','b','c'])
frame
sort_index
obj.sort_index()
a 1
b 3
c 2
d 0
dtype: int32
frame.sort_index()
frame.sort_index(axis=1)
frame.sort_index(axis=1,ascending=False)
sort_values
obj.sort_values()
d 0
a 1
c 2
b 3
dtype: int32
frame2 = pd.DataFrame(np.random.randint(100,size=(4,4)),index=['4','2','1','3'], columns=['d','b','a','c'])
frame2
| d | b | a | c |
---|
4 | 71 | 57 | 78 | 20 |
---|
2 | 89 | 92 | 67 | 70 |
---|
1 | 43 | 9 | 23 | 27 |
---|
3 | 76 | 41 | 14 | 77 |
---|
frame2.sort_values(by='b')
| d | b | a | c |
---|
1 | 43 | 9 | 23 | 27 |
---|
3 | 76 | 41 | 14 | 77 |
---|
4 | 71 | 57 | 78 | 20 |
---|
2 | 89 | 92 | 67 | 70 |
---|
frame2.sort_values(by=list('ab'))
| d | b | a | c |
---|
3 | 76 | 41 | 14 | 77 |
---|
1 | 43 | 9 | 23 | 27 |
---|
2 | 89 | 92 | 67 | 70 |
---|
4 | 71 | 57 | 78 | 20 |
---|
rank
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
obj.rank(method='first')
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
obj.rank(method='min')
0 6.0
1 1.0
2 6.0
3 4.0
4 3.0
5 2.0
6 4.0
dtype: float64
obj.rank(method='max')
0 7.0
1 1.0
2 7.0
3 5.0
4 3.0
5 2.0
6 5.0
dtype: float64
obj.rank(method='dense')
0 5.0
1 1.0
2 5.0
3 4.0
4 3.0
5 2.0
6 4.0
dtype: float64
总结:排名时用于破坏平级关系的方法
方法 | 说明 |
---|
‘average’ | 在相等分组中,为各个值平均分配排名(默认,会出现.5的那种,上面例子中,两个7都被分配了6.5) |
‘min’ | 在相等分组中,使用最小排名(上面例子中,两个7都被分配了7) |
‘max’ | 在相等分组中,使用最大排名(上面例子中,两个7都被分配了6) |
‘first’ | 按值在原始数据中的出现顺序分配排名(上面例子中,前面的7被分配了6,后面的7被分配了7) |
‘dense’ | 类似min,但排名在组间增加1,而不是组中相同元素((上面例子中,上面例子中,两个7都被分配了5)) |
2.2.7 带有重复标签的轴索引
obj = pd.Series(np.arange(5),index=['a','a','b','b','c'])
obj
a 0
a 1
b 2
b 3
c 4
dtype: int32
is_unique:索引是否是唯一的
obj.index.is_unique
False
有重复的索引会一并被选出来
obj['a']
a 0
a 1
dtype: int32
obj[['a','b']]
a 0
a 1
b 2
b 3
dtype: int32
df = pd.DataFrame(np.random.randint(100,size=(4,3)),index=list('aabb'))
df
| 0 | 1 | 2 |
---|
a | 31 | 31 | 44 |
---|
a | 90 | 5 | 92 |
---|
b | 84 | 6 | 40 |
---|
b | 32 | 32 | 38 |
---|
df.loc['b']
df.loc[['a','b']]
| 0 | 1 | 2 |
---|
a | 31 | 31 | 44 |
---|
a | 90 | 5 | 92 |
---|
b | 84 | 6 | 40 |
---|
b | 32 | 32 | 38 |
---|