Pandas的使用技巧相关知识点总结
pandas的使用技巧相关知识点总结表.png
一、数学计算与统计基础
(1)基本参数axis和skipna
# 基本参数:axis、skipna
# 创建dataframe
df = pd.DataFrame({'key1':[4,5,3,np.nan,2],
'key2':[1,2,np.nan,4,5],
'key3':[1,2,3,'j','k']},
index = ['a','b','c','d','e'])
print(df)
print(df['key1'],df['key1'].dtype)
print(df['key2'],df['key2'].dtype)
print(df['key3'],df['key3'].dtype)
print('---------------------')
data = df.mean()
print(data)
# .mean()计算均值,只统计元素是数字的列
# 按列计算每列的平均值,因key3含有字符串,返回只有key1和key2
# 单独统计一列,就索引要统计那列即可
# 单独统计key2这列
data2 = df['key2'].mean()
print(data2)
print('---------------------')
# 参数axis,默认为0,按列来计算
# 参数axis=1,则按行来计算
data3 = df.mean(axis=1)
print(data3)
print('---------------------')
# 从返回的data3结果来看,因原数据dataframe中含Nan,默认自动忽略Nan
# 参数skipna就是定义是否忽略Nan
# 参数skipna默认为True → 忽略Nan
# 参数skipna为False → 不忽略Nan,有Nan的列统计结果仍为Nan
data4 = df.mean(skipna=False)
data5 = df.mean(axis=1,skipna=False)
print(data4)
print(data5)
输出结果:
key1 key2 key3
a 4.0 1.0 1
b 5.0 2.0 2
c 3.0 NaN 3
d NaN 4.0 j
e 2.0 5.0 k
a 4.0
b 5.0
c 3.0
d NaN
e 2.0
Name: key1, dtype: float64 float64
a 1.0
b 2.0
c NaN
d 4.0
e 5.0
Name: key2, dtype: float64 float64
a 1
b 2
c 3
d j
e k
Name: key3, dtype: object object
---------------------
key1 3.5
key2 3.0
dtype: float64
3.0
---------------------
a 2.5
b 3.5
c 3.0
d 4.0
e 3.5
dtype: float64
---------------------
key1 NaN
key2 NaN
dtype: float64
a 2.5
b 3.5
c NaN
d NaN
e 3.5
dtype: float64
(2)主要数学计算方法
--->>> 数学计算方法①
# 主要数学计算方法①
# 可用于Series和DataFrame
df = pd.DataFrame({'key1':np.arange(10),
'key2':np.random.rand(10)*10})
print(df)
print('-----------------')
print(df.count(),'→ count统计非Na值的数量\n')
print(df.min(),'→ min统计最小值\n',df['key2'].max(),'→ max统计最大值\n')
print(df.quantile(q=0.75),'→ quantile统计分位数,参数q确定位置\n')
print(df.sum(),'→ sum求和\n')
print(df.mean(),'→ mean求平均值\n')
print(df.median(),'→ median求算数中位数,50%分位数\n')
print(df.std(),'\n',df.var(),'→ std,var分别求标准差,方差\n')
print(df.skew(),'→ skew样本的偏度\n')
print(df.kurt(),'→ kurt样本的峰度\n')
输出结果:
key1 key2
0 0 4.253328
1 1 0.960379
2 2 3.511730
3 3 2.509393
4 4 0.054089
5 5 8.111823
6 6 4.844677
7 7 3.198498
8 8 3.034000
9 9 3.702352
-----------------
key1 10
key2 10
dtype: int64 → count统计非Na值的数量
key1 0.000000
key2 0.054089
dtype: float64 → min统计最小值
8.111822673021239 → max统计最大值
key1 6.750000
key2 4.115584
Name: 0.75, dtype: float64 → quantile统计分位数,参数q确定位置
key1 45.000000
key2 34.180269
dtype: float64 → sum求和
key1 4.500000
key2 3.418027
dtype: float64 → mean求平均值
key1 4.500000
key2 3.355114
dtype: float64 → median求算数中位数,50%分位数
key1 3.027650
key2 2.191696
dtype: float64
key1 9.166667
key2 4.803533
dtype: float64 → std,var分别求标准差,方差
key1 0.000000
key2 0.701767
dtype: float64 → skew样本的偏度
key1 -1.200000
key2 1.858886
dtype: float64 → kurt样本的峰度
--->>> 数学计算方法②
# 主要数学计算方法②
# 可用于Series和DataFrame(2)
df = pd.DataFrame({'key1':np.arange(10),
'key2':np.random.rand(10)*10})
print(df)
print('-----------------')
# .cumsum()计算累计和
df['key1_s'] = df['key1'].cumsum()
df['key2_s'] = df['key2'].cumsum()
print(df,'→ cumsum样本的累计和\n')
print('-----------------')
# .cumprod()计算累计积
df['key1_p'] = df['key1'].cumprod()
df['key2_p'] = df['key2'].cumprod()
print(df,'→ cumprod样本的累计积\n')
print('-----------------')
# .cummax()求累计最大值,就是指累计的过程中要累计的数据中最大的值,就会一直被填充
# .cummin()求累计最小值,就是指累计的过程中要列举的数据中最小的值,就会一直被填充
print(df.cummax(),'\n',df.cummin(),'→ cummax,cummin分别求累计最大值,累计最小值\n')
# 会填充key1,和key2的值
输出结果:
key1 key2
0 0 6.782298
1 1 8.826684
2 2 5.330644
3 3 1.284093
4 4 2.040580
5 5 5.194812
6 6 4.981178
7 7 3.467392
8 8 6.802496
9 9 1.212725
-----------------
key1 key2 key1_s key2_s
0 0 6.782298 0 6.782298
1 1 8.826684 1 15.608982
2 2 5.330644 3 20.939626
3 3 1.284093 6 22.223719
4 4 2.040580 10 24.264299
5 5 5.194812 15 29.459111
6 6 4.981178 21 34.440289
7 7 3.467392 28 37.907680
8 8 6.802496 36 44.710176
9 9 1.212725 45 45.922902 → cumsum样本的累计和
-----------------
key1 key2 key1_s key2_s key1_p key2_p
0 0 6.782298 0 6.782298 0 6.782298
1 1 8.826684 1 15.608982 0 59.865201
2 2 5.330644 3 20.939626 0 319.120053
3 3 1.284093 6 22.223719 0 409.779981
4 4 2.040580 10 24.264299 0 836.188665
5 5 5.194812 15 29.459111 0 4343.843239
6 6 4.981178 21 34.440289 0 21637.455452
7 7 3.467392 28 37.907680 0 75025.529833
8 8 6.802496 36 44.710176 0 510360.865310
9 9 1.212725 45 45.922902 0 618927.582201 → cumprod样本的累计积
-----------------
key1 key2 key1_s key2_s key1_p key2_p
0 0.0 6.782298 0.0 6.782298 0.0 6.782298
1 1.0 8.826684 1.0 15.608982 0.0 59.865201
2 2.0 8.826684 3.0 20.939626 0.0 319.120053
3 3.0 8.826684 6.0 22.223719 0.0 409.779981
4 4.0 8.826684 10.0 24.264299 0.0 836.188665
5 5.0 8.826684 15.0 29.459111 0.0 4343.843239
6 6.0 8.826684 21.0 34.440289 0.0 21637.455452
7 7.0 8.826684 28.0 37.907680 0.0 75025.529833
8 8.0 8.826684 36.0 44.710176 0.0 510360.865310
9 9.0 8.826684 45.0 45.922902 0.0 618927.582201
key1 key2 key1_s key2_s key1_p key2_p
0 0.0 6.782298 0.0 6.782298 0.0 6.782298
1 0.0 6.782298 0.0 6.782298 0.0 6.782298
2 0.0 5.330644 0.0 6.782298 0.0 6.782298
3 0.0 1.284093 0.0 6.782298 0.0 6.782298
4 0.0 1.284093 0.0 6.782298 0.0 6.782298
5 0.0 1.284093 0.0 6.782298 0.0 6.782298
6 0.0 1.284093 0.0 6.782298 0.0 6.782298
7 0.0 1.284093 0.0 6.782298 0.0 6.782298
8 0.0 1.284093 0.0 6.782298 0.0 6.782298
9 0.0 1.212725 0.0 6.782298 0.0 6.782298 → cummax,cummin分别求累计最大值,累计最小
(3)唯一值判断.unique()
# 唯一值:.unique()
# 创建Series
s = pd.Series(list('asdvasdcfgg'))
print(s)
print('---------------')
# sq是只有为唯一值的Seires
sq = s.unique()
print(sq,type(sq))
# 将唯一值转变为Series
# 得到一个唯一值数组
# 通过pd.Series重新变成新的Series
s2 = pd.Series(sq)
print(s2)
sq.sort()
print(sq)
# 重新排序
输出结果:
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype: object
---------------
['a' 's' 'd' 'v' 'c' 'f' 'g']
0 a
1 s
2 d
3 v
4 c
5 f
6 g
dtype: object
['a' 'c' 'd' 'f' 'g' 's' 'v']
(4)值values的计数.value_counts()
# 值计数:.value_counts()
s = pd.Series(list('asdvasdcfgg'))
print(s)
print('-----------------')
# 对s的values进行计数(次数)
sc = s.value_counts()
print(sc)
sc2 = s.value_counts(sort=False)
print(sc2)
# 得到一个新的Series,计算出不同值出现的频率
# sort参数:排序,默认为True
print('-----------------')
print(s)
输出结果:
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype: object
-----------------
a 2
s 2
g 2
d 2
f 1
c 1
v 1
dtype: int64
v 1
d 2
g 2
c 1
f 1
s 2
a 2
dtype: int64
-----------------
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype: object
(5)成员资格判断.isin()
# 成员资格:.isin()
s = pd.Series(np.arange(10,15))
df = pd.DataFrame({'key1':list('asdcbvasd'),
'key2':np.arange(4,13)})
print(s)
print(df)
print('-----')
print(s.isin([5,14])) # 判断5或14是否在s里
print(df.isin(['a','bc','10',8]))
# 用[]表示
# 得到一个布尔值的Series或者Dataframe
输出结果:
0 10
1 11
2 12
3 13
4 14
dtype: int32
key1 key2
0 a 4
1 s 5
2 d 6
3 c 7
4 b 8
5 v 9
6 a 10
7 s 11
8 d 12
-----
0 False
1 False
2 False
3 False
4 True
dtype: bool
key1 key2
0 True False
1 False False
2 False False
3 False False
4 False True
5 False False
6 True False
7 False False
8 False False
二、文本数据的常用操作
Pandas针对字符串配备了一套方法,使其易于对数组的每个元素进行操作。
(1)通过str访问文本数据
# 通过str访问文本数据并且自动排除丢失/Nan值
# 创建Series和Dataframe
s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({'key1':list('abcdef'),
'key2':['hee','fv','w','hija','123',np.nan]})
print(s)
print('------')
print(df)
print('----------------')
# 通过str访问Series和Dataframe类型的文本数据
# str访问会自动过滤Nan值
print(s.str.count('b')) # 对s里面的每次个数据的b元素进行计数
print(df['key2'].str.upper()) # 将df里面的key2列的元素变为大写
# 对columns用str访问,因为df.columns是一个Index对象,也可使用.str
df.columns = df.columns.str.upper()
print(df)
输出结果:
0 A
1 b
2 C
3 bbhello
4 123
5 NaN
6 hj
dtype: object
------
key1 key2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
----------------
0 0.0
1 1.0
2 0.0
3 2.0
4