说明
是数据清洗的重要过程,可以按索引对齐进行运算,如果没对齐的位置则补NaN,最后也可以填充NaN
Series的对齐运算
1. Series 按行、索引对齐
s1 = pd.Series(range(10, 20), index=range(10))
s2 = pd.Series(range(20, 25), index=range(5))
print('s1: ')
print(s1)
print('')
print('s2: ')
print(s2)
效果:
s1:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
s2:
0 20
1 21
2 22
3 23
4 24
dtype: int64
2. Series的对齐运算
s1 = pd.Series(range(10, 20), index=range(10))
s2 = pd.Series(range(20, 25), index=range(5))
print(s1)
print(s2)
print(s1+s2)
效果:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
0 20
1 21
2 22
3 23
4 24
dtype: int64
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
DataFrame的对齐运算
1. DataFrame按行、列索引对齐
df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b'])
df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c'])
print('df1: ')
print(df1)
print('')
print('df2: ')
print(df2)
效果:
df1:
a b
0 1.0 1.0
1 1.0 1.0
df2:
a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
2. DataFrame的对齐运算
df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b'])
df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c'])
print('df1: ')
print(df1)
print('')
print('df2: ')
print(df2)
print('df1+df2: ')
print(df1 + df2)
效果:
df1:
a b
0 1.0 1.0
1 1.0 1.0
df2:
a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
df1+df2:
a b c
0 2.0 2.0 NaN
1 2.0 2.0 NaN
2 NaN NaN NaN
填充未对齐的数据进行运算
1. fill_value
使用add, sub, div, mul的同时,通过fill_value指定填充值,未对齐的数据将和填充值做运算
import pandas as pd
import numpy as np
# df_obj = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
# # 通过list构建Series
# ser_data = {"a": 17.8, "b": 20.1, "c": 16.5,"d":12}
# ser_obj = pd.Series(ser_data)
s1 = pd.Series(range(10, 20), index = range(10))
s2 = pd.Series(range(20, 25), index = range(5))
print(s1)
print(s2)
print(s1.add(s2, fill_value = -1))
df1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b'])
df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c'])
print(df1)
print(df2)
print(df1.sub(df2, fill_value = 2.))
效果:
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
0 20
1 21
2 22
3 23
4 24
dtype: int64
0 30.0
1 32.0
2 34.0
3 36.0
4 38.0
5 14.0
6 15.0
7 16.0
8 17.0
9 18.0
dtype: float64
a b
0 1.0 1.0
1 1.0 1.0
a b c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
a b c
0 0.0 0.0 1.0
1 0.0 0.0 1.0
2 1.0 1.0 1.0
Pandas的函数应用
1. 可直接使用NumPy的函数
df = pd.DataFrame(np.random.randn(5,4) - 1)
print(df)
print(np.abs(df))
效果:
0 1 2 3
0 -0.638228 -0.615340 -2.416771 -0.521187
1 -0.978901 -0.765940 -0.821583 -0.109666
2 -0.182581 -0.820414 -0.497785 1.638130
3 -1.398201 0.893015 -1.109652 -1.740068
4 -0.079365 -0.750413 0.847062 -1.175580
0 1 2 3
0 0.638228 0.615340 2.416771 0.521187
1 0.978901 0.765940 0.821583 0.109666
2 0.182581 0.820414 0.497785 1.638130
3 1.398201 0.893015 1.109652 1.740068
4 0.079365 0.750413 0.847062 1.175580
2. 通过apply将函数应用到列或行上
df = pd.DataFrame(np.random.randn(5, 4) - 1)
print(df)
print(df.apply(lambda x: x.max()))
效果:
0 1 2 3
0 -0.672592 -0.917094 -1.698291 -2.683744
1 -1.593442 0.308978 -0.668113 -0.867197
2 -1.023184 -0.406812 -1.993301 -0.516704
3 -0.666674 -0.524327 -2.032358 0.192416
4 -0.466286 -1.319539 -1.643544 -1.137968
0 -0.466286
1 0.308978
2 -0.668113
3 0.192416
dtype: float64
注意指定轴的方向,默认axis=0,方向是列
df = pd.DataFrame(np.random.randn(5, 4) - 1)
print(df)
print(df.apply(lambda x: x.max()))
# 指定轴方向,axis=1,方向是行
print(df.apply(lambda x : x.max(), axis=1))
效果:
0 1 2 3
0 -1.053992 -0.627906 -2.195281 -0.433810
1 -1.838847 0.821711 0.005306 -0.485479
2 -0.194641 -0.608357 0.476059 -0.989364
3 -0.935286 0.370543 -0.316234 -0.482919
4 -0.142188 -2.685907 -0.757193 -0.150942
0 -0.142188
1 0.821711
2 0.476059
3 -0.150942
dtype: float64
0 -0.433810
1 0.821711
2 0.476059
3 0.370543
4 -0.142188
dtype: float64
3. 通过applymap将函数应用到每个数据上
df = pd.DataFrame(np.random.randn(5, 4) - 1)
print(df)
# 使用applymap应用到每个数据
f2 = lambda x : '%.2f' % x
print(df.applymap(f2))
效果:
0 1 2 3
0 -1.477573 -2.256976 -1.665249 0.381750
1 -1.748229 -0.457566 -1.138169 -1.741856
2 -1.456192 -0.596993 -1.293459 1.057294
3 -0.845528 -0.725874 -2.720255 0.472505
4 -0.927104 -1.748213 -0.382931 0.046957
0 1 2 3
0 -1.48 -2.26 -1.67 0.38
1 -1.75 -0.46 -1.14 -1.74
2 -1.46 -0.60 -1.29 1.06
3 -0.85 -0.73 -2.72 0.47
4 -0.93 -1.75 -0.38 0.05
排序
1. 索引排序 sort_index()
排序默认使用升序排序,ascending=False 为降序排序
s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
print(s4)
# 索引排序
s4.sort_index() # 0 0 1 3 3
print(s4.sort_index() )
效果:
0 10
2 11
3 12
4 13
3 14
dtype: int64
0 10
2 11
3 12
3 14
4 13
对DataFrame操作时注意轴方向:
df4 = pd.DataFrame(np.random.randn(3, 5),
index=np.random.randint(3, size=3),
columns=np.random.randint(5, size=5))
print(df4)
df4_isort = df4.sort_index(axis=1, ascending=False)
print(df4_isort) # 4 2 1 1 0
效果:
1 1 4 2 0
0 0.661257 -1.022631 0.337867 -0.680210 0.018720
2 0.486521 -0.617665 -1.566189 1.484633 0.284891
2 -0.902534 2.621820 -0.278090 -0.807439 1.121617
4 2 1 1 0
0 0.337867 -0.680210 0.661257 -1.022631 0.018720
2 -1.566189 1.484633 0.486521 -0.617665 0.284891
2 -0.278090 -0.807439 -0.902534 2.621820 1.121617
2. 按值排序
sort_values(by='column name') 根据某个唯一的列名进行排序,如果有其他相同列名则报错。
df4 = pd.DataFrame(np.random.randn(3, 5))
print(df4)
# 按值排序
df4_vsort = df4.sort_values(by=0, ascending=False)
print(df4_vsort)
效果:
0 1 2 3 4
0 -0.579405 1.055458 -2.274356 -1.215769 1.582240
1 2.081478 -0.687347 0.854755 -0.011375 -2.779123
2 1.824004 -1.294691 0.940245 1.626087 -0.539030
0 1 2 3 4
1 2.081478 -0.687347 0.854755 -0.011375 -2.779123
2 1.824004 -1.294691 0.940245 1.626087 -0.539030
0 -0.579405 1.055458 -2.274356 -1.215769 1.582240
处理缺失数据
df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
[np.nan, 4., np.nan], [1., 2., 3.]])
print(df_data.head())
效果:
0 1 2
0 -3.094288 -0.914912 2.419605
1 1.000000 2.000000 NaN
2 NaN 4.000000 NaN
3 1.000000 2.000000 3.000000
判断是否存在缺失值:isnull()
丢弃缺失数据:dropna()根据axis轴方向,丢弃包含NaN的行或列
填充缺失数据:fillna()
df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
[np.nan, 4., np.nan], [1., 2., 3.]])
print(df_data.head())
# isnull
print(df_data.isnull())
# dropna
print(df_data.dropna())
print(df_data.dropna(axis=1))
# fillna
print(df_data.fillna(-100.))
效果:
0 1 2
0 -0.390745 1.712754 -0.156704
1 1.000000 2.000000 NaN
2 NaN 4.000000 NaN
3 1.000000 2.000000 3.000000
0 1 2
0 False False False
1 False False True
2 True False True
3 False False False
0 1 2
0 -0.390745 1.712754 -0.156704
3 1.000000 2.000000 3.000000
1
0 1.712754
1 2.000000
2 4.000000
3 2.000000
0 1 2
0 -0.390745 1.712754 -0.156704
1 1.000000 2.000000 -100.000000
2 -100.000000 4.000000 -100.000000
3 1.000000 2.000000 3.000000
完
码上加油站
一起来加油
长按扫码关注
![c5c03febca24f9c04a73fb008ae2ea97.png](https://i-blog.csdnimg.cn/blog_migrate/c2858106f0cce87504baf6c0c63a5548.jpeg)
点“在看”你懂得
![1ab04bdc90c34a2d01921afffdd4dca6.png](https://i-blog.csdnimg.cn/blog_migrate/d460ee0d169de526bf0605ccb58f0b22.png)