目录
一、数学和统计方法
下表中的大多数方法可用于Series和DataFrame
方法 | 描述 | 方法 | 描述 |
---|---|---|---|
count() | 统计非空值、非NaN的数量 | skew() | 样本偏度 |
sum() | 和 | kurt() | 样本峰度 |
mean() | 平均值 | quantile(q) | 统计分位数,q表示样本所在位置 |
median() | 中值,50%分位数 | apply() | Generic apply |
min() | 最小值 | cov() | Unbiased covariance (binary) |
max() | 最大值 | corr() | Correlation (binary) |
std() | 标准差 | ||
var() | 方差 | ||
cumsum() | 累计和 | cummin() | 累计最小值 |
cumprod() | 累计积 | cummax() | 累计最大值 |
isin(arr) | 判断每个值是否属于数组arr中的元素 | ||
unique() | 得到唯一值数组,仅支持Series | ||
value_counts() | 值计数,仅支持Series |
df = pd.DataFrame({"value1": np.arange(6),
"value2": np.random.rand(6) * 10})
'''
value1 value2
0 0 6.635426
1 1 2.356701
2 2 4.686041
3 3 1.935749
4 4 2.595264
5 5 3.502797
'''
df.count()
'''
value1 6
value2 6
dtype: int64
'''
df.sum()
'''
value1 15.000000
value2 21.711979
dtype: float64
'''
df.mean()
'''
value1 2.500000
value2 3.618663
dtype: float64
'''
df.median()
'''
value1 2.500000
value2 3.049031
dtype: float64
'''
df.min()
'''
value1 0.000000
value2 1.935749
dtype: float64
'''
df['value2'].max() # 6.63542632315
df.std()
'''
value1 1.870829
value2 1.773453
dtype: float64
'''
df.var()
'''
value1 3.500000
value2 3.145134
dtype: float64
'''
df.skew()
'''
value1 0.000000
value2 1.119899
dtype: float64
'''
df.kurt()
'''
value1 -1.20000
value2 0.52777
dtype: float64
'''
df.quantile(q=0.75)
'''
value1 3.75000
value2 4.39023
dtype: float64
'''
df.cov()
'''
value1 value2
value1 3.500000 -1.769775
value2 -1.769775 3.145135
'''
df.corr()
'''
value1 value2
value1 1.000000 -0.533414
value2 -0.533414 1.000000
'''
df.cumsum()
'''
value1 value2
0 0.0 6.635426
1 1.0 8.992127
2 3.0 13.678168
3 6.0 15.613917
4 10.0 18.209181
5 15.0 21.711978
'''
df.cumprod()
'''
value1 value2
0 0.0 6.635426
1 0.0 15.637715
2 0.0 73.278974
3 0.0 141.849701
4 0.0 368.137422
5 0.0 1289.510657
'''
df.cummin()
'''
value1 value2
0 0.0 6.635426
1 0.0 2.356701
2 0.0 2.356701
3 0.0 1.935749
4 0.0 1.935749
5 0.0 1.935749
'''
df.cummax()
'''
value1 value2
0 0.0 6.635426
1 1.0 6.635426
2 2.0 6.635426
3 3.0 6.635426
4 4.0 6.635426
5 5.0 6.635426
'''
只统计数字列,可以通过索引单独统计一列。
skipna参数:是否忽略NaN,默认为True;值为False时,有NaN的列统计结果为NaN。
axis参数:默认为0,以列来计算,axis=1,以行来计算。
df = pd.DataFrame({'key1':[4, 5, 3, np.nan, 2],
'key2':[1, 2, 3, 'j', 'k']},
index = ['a', 'b', 'c', 'd', 'e'])
'''
key1 key2
a 4.0 1
b 5.0 2
c 3.0 3
d NaN j
e 2.0 k
'''
df['key1'].dtype # float64
df['key2'].dtype # object
df.mean()
'''
key1 3.5
dtype: float64
'''
df["key1"].mean() # 3.5
df.mean(skipna = False)
'''
key1 NaN
dtype: float64
'''
df.mean(axis = 1)
'''
a 4.0
b 5.0
c 3.0
d NaN
e 2.0
dtype: float64
'''
isin()
s = pd.Series(np.arange(3,6))
'''
0 3
1 4
2 5
dtype: int32
'''
s.isin([4,10])
'''
0 False
1 True
2 False
dtype: bool
'''
df = pd.DataFrame({'key1': list('asdcbv'),
'key2': np.arange(4, 10)})
'''
key1 key2
0 a 4
1 s 5
2 d 6
3 c 7
4 b 8
5 v 9
'''
df.isin(["s", "sd", "v", 5, "9"])
'''
key1 key2
0 False False
1 True True
2 False False
3 False False
4 False False
5 True False
'''
unique()和value_counts()
s = pd.Series(list("fdsgfdgdgsagaaew"))
s.unique() # 得到唯一值组成的数组
# ['f' 'd' 's' 'g' 'a' 'e' 'w']
# <class 'numpy.ndarray'>
pd.Series(s.unique()) # 通过pd.Series重新变成新的Series
'''
0 f
1 d
2 s
3 g
4 a
5 e
6 w
dtype: object
'''
s.value_counts()
'''
g 4
a 3
d 3
s 2
f 2
w 1
e 1
dtype: int64
'''
s.value_counts(sort = False)
'''
d 3
e 1
g 4
a 3
w 1
f 2
s 2
dtype: int64
'''
二、字符串方法
pandas中操作字符串的方法:
- 通过.str调用字符串方法,基本上和python内建的字符串方法同名;
- 可以对Series使用,Dataframe可以提取Series,如df.columns或df[“key1”]等;
- 自动跳过缺失/NA值。
方法 | 描述 |
---|---|
[i]、.get(i) | 获取字符串(数组也支持)第i个元素,[]中可以是切片获取多个位置的元素 |
len() | 计算字符串长度 |
count() | 对符合规则的进行计数 |
upper()、lower() | 转为大、小写字母 |
capitalize() | 首字母大写 |
strip()、lstrip()、rstrip() | 去掉两端、左、右空格 |
startswith()、endswith() | 以某规则开始、结束 |
replace( , , n=i) | 将匹配到的子串或正则表达式替换为另外的字符串,n为替换个数 |
split( , expand=False/True, n=i)、rsplit() | 基于分隔符切分、逆向切分字符串,expand是否拓展为DataFrame,n限制分割次数 |
s = pd.Series(['A', 'b', 'C', 'bbhello', '123', np.nan, 'hj'])
df = pd.DataFrame({"key1": ["A ", "b", "vf sc ", np.nan, " ffj", "123 "],
"key2 ":list("hijkmn")})
s.str.len()
'''
0 1.0
1 1.0
2 1.0
3 7.0
4 3.0
5 NaN
6 2.0
dtype: float64
'''
df["key1"].str.count('f')
'''
0 0.0
1 0.0
2 1.0
3 NaN
4 2.0
5 0.0
Name: key1, dtype: float64
'''
df["key1"].str.startswith('v')
'''
0 False
1 False
2 True
3 NaN
4 False
5 False
Name: key1, dtype: object
'''
df.columns = df.columns.str.rstrip().str.upper() # Index(['KEY1', 'KEY2'], dtype='object')
df["KEY1"] = df["KEY1"].str.strip()
df
'''
KEY1 KEY2
0 A h
1 b i
2 vf sc j
3 NaN k
4 ffj m
5 123 n
'''
[]和get()
s = pd.Series(['ab,b,c', '1,2,3', 'C', ['a,,,c','a,b'], np.nan])
s.str[1]
'''
0 b
1 ,
2 NaN
3 a,b
4 NaN
dtype: object
'''
s.str.get(1)
'''
0 b
1 ,
2 NaN
3 a,b
4 NaN
dtype: object
'''
s.str[0::2]
'''
0 a,,
1 123
2 C
3 [a,,,c]
4 NaN
dtype: object
'''
replace()
s = pd.Series(['A', 'b', 'bbb hello', '123 4', 'hj'])
s.str.replace(" ", "_")
'''
0 A
1 b
2 bbb_hello
3 123_4
4 hj
dtype: object
'''
s.str.replace("b", "C", n = 1)
'''
0 A
1 C
2 Cbb hello
3 123 4
4 hj
dtype: object
'''
split()和rsplit()
s = pd.Series(['ab,b,c', '1,2,3', 'C', ['a,,,c','a,b'], np.nan])
s.str.split(",")
'''
0 [ab, b, c]
1 [1, 2, 3]
2 [C]
3 NaN
4 NaN
dtype: object
'''
s.str.split(",").str[1]
'''
0 b
1 2
2 NaN
3 NaN
4 NaN
dtype: object
'''
s.str.rsplit(",")
'''
0 [ab, b, c]
1 [1, 2, 3]
2 [C]
3 NaN
4 NaN
dtype: object
'''
s.str.split(",", n = 1)
'''
0 [ab, b,c]
1 [1, 2,3]
2 [C]
3 NaN
4 NaN
dtype: object
'''
s.str.split(",", expand = True)
'''
0 1 2
0 ab b c
1 1 2 3
2 C None None
3 NaN None None
4 NaN None None
'''
s.str.split(",", expand = True, n = 1)
'''
0 1
0 ab b,c
1 1 2,3
2 C None
3 NaN None
4 NaN None
'''
s.str.rsplit(",", expand = True, n = 1)
'''
0 1
0 ab,b c
1 1,2 3
2 C None
3 NaN None
4 NaN None
'''
三、合并 merge()、join()
pd.merge(left, right, how=‘inner’, on=None, sort=True,
left_on=None, right_on=None,
left_index=False, right_index=False,
suffixes=(’_x’, ‘_y’), copy=True, indicator=False)
left:第一个df
right:第二个df
on:合并的参考键,多个写成[“key1”, “key2”]数组形式
how:合并方式
- inner,取交集(默认)
- outer,取并集,数据缺失为NaN
- left,按照left为参考合并,数据缺失为NaN
- right,按照right为参考合并,数据缺失为NaN
left_on, right_on, left_index, right_index:当两个DataFrame的键不同时,可以单独设置左键与右键,可以相互结合left_on + right_on, left_on + right_index, left_index + right_on, left_index + right_index
sort:按字典顺序通过连接键对结果DataFrame进行排序。默认为False,False时大幅提高性能。
尽量不通过sort排序,有需要时直接用Dataframe的排序方法sort_values和sort_index。
suffixes:当两个DataFrame中包含相同的index且该index不为连接键,给该index加后缀加以区分。
merge()与SQL等关系数据库、Excel的vlookup类似。
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df3 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df4 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df1
'''
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3
'''
df2
'''
C D key
0 C0 D0 K0
1 C1 D1 K1
2 C2 D2 K2
3 C3 D3 K3
'''
pd.merge(df1, df2, on = "key")
'''
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
'''
df3
'''
A B key1 key2
0 A0 B0 K0 K0
1 A1 B1 K0 K1
2 A2 B2 K1 K0
3 A3 B3 K2 K1
'''
df4
'''
C D key1 key2
0 C0 D0 K0 K0
1 C1 D1 K1 K0
2 C2 D2 K1 K0
3 C3 D3 K2 K0
'''
pd.merge(df3, df4, on = ['key1', 'key2'])
pd.merge(df3, df4, on = ['key1', 'key2'], how = "inner")
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "outer")
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "left")
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "right")
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3
'''
df1 = pd.DataFrame({'lkey': list('bbacaab'),
'data1': range(7)},
index = list('abcdefg'))
df2 = pd.DataFrame({'rkey': list('abd'),
'data2': range(2, 5)})
df1
'''
data1 lkey
a 0 b
b 1 b
c 2 a
d 3 c
e 4 a
f 5 a
g 6 b
'''
df2
'''
data2 rkey
0 2 a
1 3 b
2 4 d
'''
pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey')
'''
data1 lkey data2 rkey
0 0 b 3 b
1 1 b 3 b
2 6 b 3 b
3 2 a 2 a
4 4 a 2 a
5 5 a 2 a
'''
pd.merge(df1, df2, left_index = True, right_on = "rkey")
'''
data1 lkey data2 rkey
0 0 b 2 a
1 1 b 3 b
2 3 c 4 d
'''
pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey', sort = True)
'''
data1 lkey data2 rkey
0 2 a 2 a
1 4 a 2 a
2 5 a 2 a
3 0 b 3 b
4 1 b 3 b
5 6 b 3 b
'''
pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey').sort_values("data1")
'''
data1 lkey data2 rkey
0 0 b 3 b
1 1 b 3 b
3 2 a 2 a
4 4 a 2 a
5 5 a 2 a
2 6 b 3 b
'''
join():直接通过索引连接。
df1.join(df2)相当于pd.merge(df1, df2, left_index = True, right_index = True, how = “left”)
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'KEY': ['K1', 'K2', 'K3']},
index = ['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index = ['K0', 'K2', 'K3'])
df1
'''
A B KEY
K0 A0 B0 K1
K1 A1 B1 K2
K2 A2 B2 K3
'''
df2
'''
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
'''
df1.join(df2)
# pd.merge(df1, df2, left_index = True, right_index = True, how = "left")
'''
A B KEY C D
K0 A0 B0 K1 C0 D0
K1 A1 B1 K2 NaN NaN
K2 A2 B2 K3 C2 D2
'''
df1.join(df2, how = "inner")
'''
A B KEY C D
K0 A0 B0 K1 C0 D0
K2 A2 B2 K3 C2 D2
'''
df1.join(df2, on = "KEY")
# pd.merge(df1, df2, left_on = "KEY", right_index = True, how = "left")
'''
A B KEY C D
K0 A0 B0 K1 NaN NaN
K1 A1 B1 K2 C2 D2
K2 A2 B2 K3 C3 D3
'''
四、连接与修补 concat()、combine_first()、update()
pd.concat(objs, axis=0, join=‘outer’, join_axes=None, keys=None,
ignore_index=False, levels=None, names=None, verify_integrity=False, copy=True)
objs:待连接的Series或DataFrame组成的数组
axis:连接的方向,默认为0,表示行与行连接;可设置为1,表示列与列连接
join:连接方式,如何处理在其它轴的索引
- outer,并集(默认)
- inner,交集
join_axes:指定连接后的index
keys:作为最外层索引
s1 = pd.Series([1, 2, 3], index = ['a', 'b', 'c'])
'''
a 1
b 2
c 3
dtype: int64
'''
s2 = pd.Series([2, 3, 4], index = ['b', 'c', 'd'])
'''
b 2
c 3
d 4
dtype: int64
'''
pd.concat([s1, s2])
'''
a 1
b 2
c 3
b 2
c 3
d 4
dtype: int64
'''
pd.concat([s1, s2], axis = 1)
'''
0 1
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
'''
pd.concat([s1, s2], axis = 1, join = "inner")
'''
0 1
b 2 2
c 3 3
'''
pd.concat([s1, s2], axis = 1, join_axes = [["a", "c"]])
'''
0 1
a 1 NaN
c 3 3.0
'''
res1 = pd.concat([s1, s2], keys = ["one", "two"]) # <class 'pandas.core.series.Series'>
'''
one a 1
b 2
c 3
two b 2
c 3
d 4
dtype: int64
'''
res1.index
'''
MultiIndex(levels=[['one', 'two'], ['a', 'b', 'c', 'd']],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 1, 2, 3]])
'''
res1.loc["one"]
'''
a 1
b 2
c 3
dtype: int64
'''
res1["one"]["c"] # 3
res2 = pd.concat([s1, s2], axis = 1, keys = ["one", "two"]) # <class 'pandas.core.frame.DataFrame'>
'''
one two
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
'''
res2.index # Index(['a', 'b', 'c', 'd'], dtype='object')
combine_first():根据index,df1的空值被df2替代;如果df2的index多于df1,则更新到df1上。
update():相同index位置,df2覆盖df1;如果df2的index多于df1,不会更新到df1。
df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, np.nan, np.nan], [np.nan, 7., np.nan]])
'''
0 1 2
0 NaN 3.0 5.0
1 -4.6 NaN NaN
2 NaN 7.0 NaN
'''
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4], [1., 2., 3.]], index=[1, 2, 3])
'''
0 1 2
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
3 1.0 2.0 3.0
'''
df1.combine_first(df2)
'''
0 1 2
0 NaN 3.0 5.0
1 -4.6 NaN -8.2
2 -5.0 7.0 4.0
3 1.0 2.0 3.0
'''
df1.update(df2) # 返回值为None,更改了df1本身
df1
'''
0 1 2
0 NaN 3.0 5.0
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
'''
五、去重与替换 duplicated()、drop_duplicates()、replace()
duplicated():返回布尔型Series,通过布尔判断,得到不重复的值
drop_duplicates():得到不重复的值,inplace参数是否替换原值,默认False
s = pd.Series([1, 3, 2, 2, 1, 1, 2, 2, 3, 4, 5, 5])
s.duplicated()
'''
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 False
10 False
11 True
dtype: bool
'''
s[s.duplicated() == False] # 通过布尔判断,得到不重复的值
'''
0 1
1 3
2 2
9 4
10 5
dtype: int64
'''
df = pd.DataFrame({'key1': ['a', 'a', 3, 4, 5],
'key2': ['a', 'a', 'b', 'b', 'c']})
df.duplicated() # 判断行与行之间是否重复
'''
0 False
1 True
2 False
3 False
4 False
dtype: bool
'''
s.drop_duplicates()
'''
0 1
1 3
2 2
9 4
10 5
dtype: int64
'''
replace():可传入列表或字典,一次性替换一个值或多个值。
s = pd.Series(list('ascaasd'))
s.replace('a', np.nan) # 一对一
'''
0 NaN
1 s
2 c
3 NaN
4 NaN
5 s
6 d
dtype: object
'''
s.replace(['a', 's'], np.nan) # 多对一
'''
0 NaN
1 NaN
2 c
3 NaN
4 NaN
5 NaN
6 d
dtype: object
'''
s.replace({'a': 'Hello', 's': 123}) # 多对多
'''
0 Hello
1 123
2 c
3 Hello
4 Hello
5 123
6 d
dtype: object
'''