[三] 2 数据分析工具：Pandas -- 运算

最新推荐文章于 2024-06-06 18:07:58 发布

数佳

最新推荐文章于 2024-06-06 18:07:58 发布

阅读量140

点赞数

分类专栏：数据分析文章标签： pandas .str merge concat replace

本文链接：https://blog.csdn.net/yaocong1993/article/details/114596812

版权

数据分析专栏收录该内容

15 篇文章 1 订阅

订阅专栏

一、数学和统计方法

Pandas计算工具

下表中的大多数方法可用于Series和DataFrame

方法	描述	方法	描述
count()	统计非空值、非NaN的数量	skew()	样本偏度
sum()	和	kurt()	样本峰度
mean()	平均值	quantile(q)	统计分位数，q表示样本所在位置
median()	中值，50%分位数	apply()	Generic apply
min()	最小值	cov()	Unbiased covariance (binary)
max()	最大值	corr()	Correlation (binary)
std()	标准差
var()	方差

cumsum()	累计和	cummin()	累计最小值
cumprod()	累计积	cummax()	累计最大值

isin(arr)	判断每个值是否属于数组arr中的元素
unique()	得到唯一值数组，仅支持Series
value_counts()	值计数，仅支持Series

df = pd.DataFrame({"value1": np.arange(6),
                  "value2": np.random.rand(6) * 10})
'''
   value1    value2
0       0  6.635426
1       1  2.356701
2       2  4.686041
3       3  1.935749
4       4  2.595264
5       5  3.502797
'''

df.count()
'''
value1    6
value2    6
dtype: int64
'''
df.sum()
'''
value1    15.000000
value2    21.711979
dtype: float64
'''
df.mean()
'''
value1    2.500000
value2    3.618663
dtype: float64
'''
df.median()
'''
value1    2.500000
value2    3.049031
dtype: float64
'''
df.min()
'''
value1    0.000000
value2    1.935749
dtype: float64
'''
df['value2'].max() # 6.63542632315
df.std()
'''
value1    1.870829
value2    1.773453
dtype: float64
'''
df.var()
'''
value1    3.500000
value2    3.145134
dtype: float64
'''
df.skew()
'''
value1    0.000000
value2    1.119899
dtype: float64
'''
df.kurt()
'''
value1   -1.20000
value2    0.52777
dtype: float64
'''
df.quantile(q=0.75)
'''
value1    3.75000
value2    4.39023
dtype: float64
'''
df.cov()
'''
          value1    value2
value1  3.500000 -1.769775
value2 -1.769775  3.145135
'''
df.corr()
'''
          value1    value2
value1  1.000000 -0.533414
value2 -0.533414  1.000000
'''

df.cumsum()
'''
   value1     value2
0     0.0   6.635426
1     1.0   8.992127
2     3.0  13.678168
3     6.0  15.613917
4    10.0  18.209181
5    15.0  21.711978
'''
df.cumprod()
'''
   value1       value2
0     0.0     6.635426
1     0.0    15.637715
2     0.0    73.278974
3     0.0   141.849701
4     0.0   368.137422
5     0.0  1289.510657
'''
df.cummin()
'''
   value1    value2
0     0.0  6.635426
1     0.0  2.356701
2     0.0  2.356701
3     0.0  1.935749
4     0.0  1.935749
5     0.0  1.935749
'''
df.cummax()
'''
   value1    value2
0     0.0  6.635426
1     1.0  6.635426
2     2.0  6.635426
3     3.0  6.635426
4     4.0  6.635426
5     5.0  6.635426
'''

只统计数字列，可以通过索引单独统计一列。
skipna参数：是否忽略NaN，默认为True；值为False时，有NaN的列统计结果为NaN。
axis参数：默认为0，以列来计算，axis=1，以行来计算。

df = pd.DataFrame({'key1':[4, 5, 3, np.nan, 2],
                   'key2':[1, 2, 3, 'j', 'k']},
                  index = ['a', 'b', 'c', 'd', 'e'])
'''
   key1 key2
a   4.0    1
b   5.0    2
c   3.0    3
d   NaN    j
e   2.0    k
'''
df['key1'].dtype # float64
df['key2'].dtype # object

df.mean()
'''
key1    3.5
dtype: float64
'''
df["key1"].mean() # 3.5
df.mean(skipna = False)
'''
key1   NaN
dtype: float64
'''
df.mean(axis = 1)
'''
a    4.0
b    5.0
c    3.0
d    NaN
e    2.0
dtype: float64
'''

isin()

s = pd.Series(np.arange(3,6))
'''
0    3
1    4
2    5
dtype: int32
'''
s.isin([4,10])
'''
0    False
1     True
2    False
dtype: bool
'''

df = pd.DataFrame({'key1': list('asdcbv'),
                   'key2': np.arange(4, 10)})
'''
  key1  key2
0    a     4
1    s     5
2    d     6
3    c     7
4    b     8
5    v     9
'''
df.isin(["s", "sd", "v", 5, "9"])
'''
    key1   key2
0  False  False
1   True   True
2  False  False
3  False  False
4  False  False
5   True  False
'''

unique()和value_counts()

s = pd.Series(list("fdsgfdgdgsagaaew"))
s.unique() # 得到唯一值组成的数组
# ['f' 'd' 's' 'g' 'a' 'e' 'w']
# <class 'numpy.ndarray'>
pd.Series(s.unique()) # 通过pd.Series重新变成新的Series
'''
0    f
1    d
2    s
3    g
4    a
5    e
6    w
dtype: object
'''

s.value_counts()
'''
g    4
a    3
d    3
s    2
f    2
w    1
e    1
dtype: int64
'''
s.value_counts(sort = False)
'''
d    3
e    1
g    4
a    3
w    1
f    2
s    2
dtype: int64
'''

二、字符串方法

Pandas文本字符串方法

pandas中操作字符串的方法：

通过.str调用字符串方法，基本上和python内建的字符串方法同名；
可以对Series使用，Dataframe可以提取Series，如df.columns或df[“key1”]等；
自动跳过缺失/NA值。

方法	描述
[i]、.get(i)	获取字符串(数组也支持)第i个元素，[]中可以是切片获取多个位置的元素
len()	计算字符串长度
count()	对符合规则的进行计数
upper()、lower()	转为大、小写字母
capitalize()	首字母大写
strip()、lstrip()、rstrip()	去掉两端、左、右空格
startswith()、endswith()	以某规则开始、结束
replace( , , n=i)	将匹配到的子串或正则表达式替换为另外的字符串，n为替换个数
split( , expand=False/True, n=i)、rsplit()	基于分隔符切分、逆向切分字符串，expand是否拓展为DataFrame，n限制分割次数

s = pd.Series(['A', 'b', 'C', 'bbhello', '123', np.nan, 'hj'])
df = pd.DataFrame({"key1": ["A ", "b", "vf sc ", np.nan, " ffj", "123 "],
                  "key2  ":list("hijkmn")})
s.str.len()
'''
0    1.0
1    1.0
2    1.0
3    7.0
4    3.0
5    NaN
6    2.0
dtype: float64
'''
df["key1"].str.count('f')
'''
0    0.0
1    0.0
2    1.0
3    NaN
4    2.0
5    0.0
Name: key1, dtype: float64
'''
df["key1"].str.startswith('v')
'''
0    False
1    False
2     True
3      NaN
4    False
5    False
Name: key1, dtype: object
'''
df.columns = df.columns.str.rstrip().str.upper() # Index(['KEY1', 'KEY2'], dtype='object')
df["KEY1"] = df["KEY1"].str.strip()
df
'''
    KEY1 KEY2
0      A    h
1      b    i
2  vf sc    j
3    NaN    k
4    ffj    m
5    123    n
'''

[]和get()

s = pd.Series(['ab,b,c', '1,2,3', 'C', ['a,,,c','a,b'], np.nan])
s.str[1]
'''
0      b
1      ,
2    NaN
3    a,b
4    NaN
dtype: object
'''
s.str.get(1)
'''
0      b
1      ,
2    NaN
3    a,b
4    NaN
dtype: object
'''
s.str[0::2]
'''
0        a,,
1        123
2          C
3    [a,,,c]
4        NaN
dtype: object
'''

replace()

s = pd.Series(['A', 'b', 'bbb hello', '123 4', 'hj'])
s.str.replace(" ", "_")
'''
0            A
1            b
2    bbb_hello
3        123_4
4           hj
dtype: object
'''
s.str.replace("b", "C", n = 1)
'''
0            A
1            C
2    Cbb hello
3        123 4
4           hj
dtype: object
'''

split()和rsplit()

s = pd.Series(['ab,b,c', '1,2,3', 'C', ['a,,,c','a,b'], np.nan])
s.str.split(",")
'''
0    [ab, b, c]
1     [1, 2, 3]
2           [C]
3           NaN
4           NaN
dtype: object
'''
s.str.split(",").str[1]
'''
0      b
1      2
2    NaN
3    NaN
4    NaN
dtype: object
'''
s.str.rsplit(",")
'''
0    [ab, b, c]
1     [1, 2, 3]
2           [C]
3           NaN
4           NaN
dtype: object
'''

s.str.split(",", n = 1)
'''
0    [ab, b,c]
1     [1, 2,3]
2          [C]
3          NaN
4          NaN
dtype: object
'''
s.str.split(",", expand = True)
'''
     0     1     2
0   ab     b     c
1    1     2     3
2    C  None  None
3  NaN  None  None
4  NaN  None  None
'''
s.str.split(",", expand = True, n = 1)
'''
     0     1
0   ab   b,c
1    1   2,3
2    C  None
3  NaN  None
4  NaN  None
'''
s.str.rsplit(",", expand = True, n = 1)
'''
      0     1
0  ab,b     c
1   1,2     3
2     C  None
3   NaN  None
4   NaN  None
'''

三、合并 merge()、join()

合并merge()

pd.merge(left, right, how=‘inner’, on=None, sort=True,
left_on=None, right_on=None,
left_index=False, right_index=False,
suffixes=(’_x’, ‘_y’), copy=True, indicator=False)

left：第一个df
right：第二个df

on：合并的参考键，多个写成[“key1”, “key2”]数组形式

how：合并方式

inner，取交集（默认）
outer，取并集，数据缺失为NaN
left，按照left为参考合并，数据缺失为NaN
right，按照right为参考合并，数据缺失为NaN

left_on, right_on, left_index, right_index：当两个DataFrame的键不同时，可以单独设置左键与右键，可以相互结合left_on + right_on, left_on + right_index, left_index + right_on, left_index + right_index

sort：按字典顺序通过连接键对结果DataFrame进行排序。默认为False，False时大幅提高性能。
尽量不通过sort排序，有需要时直接用Dataframe的排序方法sort_values和sort_index。

suffixes：当两个DataFrame中包含相同的index且该index不为连接键，给该index加后缀加以区分。

merge()与SQL等关系数据库、Excel的vlookup类似。

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
df3 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
df4 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                    'key2': ['K0', 'K0', 'K0', 'K0'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})
df1
'''
    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
'''
df2
'''
    C   D key
0  C0  D0  K0
1  C1  D1  K1
2  C2  D2  K2
3  C3  D3  K3
'''
pd.merge(df1, df2, on = "key")
'''
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3
'''

df3
'''
    A   B key1 key2
0  A0  B0   K0   K0
1  A1  B1   K0   K1
2  A2  B2   K1   K0
3  A3  B3   K2   K1
'''
df4
'''
    C   D key1 key2
0  C0  D0   K0   K0
1  C1  D1   K1   K0
2  C2  D2   K1   K0
3  C3  D3   K2   K0
'''
pd.merge(df3, df4, on = ['key1', 'key2'])
pd.merge(df3, df4, on = ['key1', 'key2'], how = "inner")
'''
    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
1  A2  B2   K1   K0  C1  D1
2  A2  B2   K1   K0  C2  D2
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "outer")
'''
     A    B key1 key2    C    D
0   A0   B0   K0   K0   C0   D0
1   A1   B1   K0   K1  NaN  NaN
2   A2   B2   K1   K0   C1   D1
3   A2   B2   K1   K0   C2   D2
4   A3   B3   K2   K1  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "left")
'''
    A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A2  B2   K1   K0   C2   D2
4  A3  B3   K2   K1  NaN  NaN
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "right")
'''
     A    B key1 key2   C   D
0   A0   B0   K0   K0  C0  D0
1   A2   B2   K1   K0  C1  D1
2   A2   B2   K1   K0  C2  D2
3  NaN  NaN   K2   K0  C3  D3
'''

df1 = pd.DataFrame({'lkey': list('bbacaab'),
                    'data1': range(7)},
                   index = list('abcdefg'))
df2 = pd.DataFrame({'rkey': list('abd'),
                   'data2': range(2, 5)})
df1
'''
   data1 lkey
a      0    b
b      1    b
c      2    a
d      3    c
e      4    a
f      5    a
g      6    b
'''
df2
'''
   data2 rkey
0      2    a
1      3    b
2      4    d
'''

pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey')
'''
   data1 lkey  data2 rkey
0      0    b      3    b
1      1    b      3    b
2      6    b      3    b
3      2    a      2    a
4      4    a      2    a
5      5    a      2    a
'''
pd.merge(df1, df2, left_index = True, right_on = "rkey")
'''
   data1 lkey  data2 rkey
0      0    b      2    a
1      1    b      3    b
2      3    c      4    d
'''

pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey', sort = True)
'''
   data1 lkey  data2 rkey
0      2    a      2    a
1      4    a      2    a
2      5    a      2    a
3      0    b      3    b
4      1    b      3    b
5      6    b      3    b
'''
pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey').sort_values("data1")
'''
   data1 lkey  data2 rkey
0      0    b      3    b
1      1    b      3    b
3      2    a      2    a
4      4    a      2    a
5      5    a      2    a
2      6    b      3    b
'''

join()：直接通过索引连接。
df1.join(df2)相当于pd.merge(df1, df2, left_index = True, right_index = True, how = “left”)

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'KEY': ['K1', 'K2', 'K3']},
                    index = ['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                    index = ['K0', 'K2', 'K3'])
df1
'''
     A   B KEY
K0  A0  B0  K1
K1  A1  B1  K2
K2  A2  B2  K3
'''
df2
'''
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3
'''
df1.join(df2)
# pd.merge(df1, df2, left_index = True, right_index = True, how = "left")
'''
     A   B KEY    C    D
K0  A0  B0  K1   C0   D0
K1  A1  B1  K2  NaN  NaN
K2  A2  B2  K3   C2   D2
'''
df1.join(df2, how = "inner")
'''
     A   B KEY   C   D
K0  A0  B0  K1  C0  D0
K2  A2  B2  K3  C2  D2
'''
df1.join(df2, on = "KEY")
# pd.merge(df1, df2, left_on = "KEY", right_index = True, how = "left")
'''
     A   B KEY    C    D
K0  A0  B0  K1  NaN  NaN
K1  A1  B1  K2   C2   D2
K2  A2  B2  K3   C3   D3
'''

四、连接与修补 concat()、combine_first()、update()

连接concat()

pd.concat(objs, axis=0, join=‘outer’, join_axes=None, keys=None,
ignore_index=False, levels=None, names=None, verify_integrity=False, copy=True)

objs：待连接的Series或DataFrame组成的数组

axis：连接的方向，默认为0，表示行与行连接；可设置为1，表示列与列连接

join：连接方式，如何处理在其它轴的索引

outer，并集（默认）
inner，交集

join_axes：指定连接后的index

keys：作为最外层索引

s1 = pd.Series([1, 2, 3], index = ['a', 'b', 'c'])
'''
a    1
b    2
c    3
dtype: int64
'''
s2 = pd.Series([2, 3, 4], index = ['b', 'c', 'd'])
'''
b    2
c    3
d    4
dtype: int64
'''

pd.concat([s1, s2])
'''
a    1
b    2
c    3
b    2
c    3
d    4
dtype: int64
'''

pd.concat([s1, s2], axis = 1)
'''
     0    1
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
'''

pd.concat([s1, s2], axis = 1, join = "inner")
'''
   0  1
b  2  2
c  3  3
'''

pd.concat([s1, s2], axis = 1, join_axes = [["a", "c"]])
'''
   0    1
a  1  NaN
c  3  3.0
'''

res1 = pd.concat([s1, s2], keys = ["one", "two"]) # <class 'pandas.core.series.Series'>
'''
one  a    1
     b    2
     c    3
two  b    2
     c    3
     d    4
dtype: int64
'''
res1.index
'''
MultiIndex(levels=[['one', 'two'], ['a', 'b', 'c', 'd']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 1, 2, 3]])
'''
res1.loc["one"]
'''
a    1
b    2
c    3
dtype: int64
'''
res1["one"]["c"] # 3

res2 = pd.concat([s1, s2], axis = 1, keys = ["one", "two"]) # <class 'pandas.core.frame.DataFrame'>
'''
   one  two
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
'''
res2.index # Index(['a', 'b', 'c', 'd'], dtype='object')

combine_first()：根据index，df1的空值被df2替代；如果df2的index多于df1，则更新到df1上。

update()：相同index位置，df2覆盖df1；如果df2的index多于df1，不会更新到df1。

df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, np.nan, np.nan], [np.nan, 7., np.nan]])
'''
     0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN  NaN
2  NaN  7.0  NaN
'''
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4], [1., 2., 3.]], index=[1, 2, 3])
'''
      0    1    2
1 -42.6  NaN -8.2
2  -5.0  1.6  4.0
3   1.0  2.0  3.0
'''
df1.combine_first(df2)
'''
     0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN -8.2
2 -5.0  7.0  4.0
3  1.0  2.0  3.0
'''

df1.update(df2) # 返回值为None，更改了df1本身
df1
'''
      0    1    2
0   NaN  3.0  5.0
1 -42.6  NaN -8.2
2  -5.0  1.6  4.0
'''

五、去重与替换 duplicated()、drop_duplicates()、replace()

duplicated()：返回布尔型Series，通过布尔判断，得到不重复的值
drop_duplicates()：得到不重复的值，inplace参数是否替换原值，默认False

s = pd.Series([1, 3, 2, 2, 1, 1, 2, 2, 3, 4, 5, 5])
s.duplicated()
'''
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9     False
10    False
11     True
dtype: bool
'''
s[s.duplicated() == False] # 通过布尔判断，得到不重复的值
'''
0     1
1     3
2     2
9     4
10    5
dtype: int64
'''

df = pd.DataFrame({'key1': ['a', 'a', 3, 4, 5],
                   'key2': ['a', 'a', 'b', 'b', 'c']})
df.duplicated() # 判断行与行之间是否重复
'''
0    False
1     True
2    False
3    False
4    False
dtype: bool
'''

s.drop_duplicates()
'''
0     1
1     3
2     2
9     4
10    5
dtype: int64
'''

replace()：可传入列表或字典，一次性替换一个值或多个值。

s = pd.Series(list('ascaasd'))
s.replace('a', np.nan) # 一对一
'''
0    NaN
1      s
2      c
3    NaN
4    NaN
5      s
6      d
dtype: object
'''
s.replace(['a', 's'], np.nan) # 多对一
'''
0    NaN
1    NaN
2      c
3    NaN
4    NaN
5    NaN
6      d
dtype: object
'''
s.replace({'a': 'Hello', 's': 123}) # 多对多
'''
0    Hello
1      123
2        c
3    Hello
4    Hello
5      123
6        d
dtype: object
'''