[三] 2 数据分析工具:Pandas -- 运算

一、数学和统计方法

Pandas计算工具

下表中的大多数方法可用于Series和DataFrame

方法描述方法描述
count()统计非空值、非NaN的数量skew()样本偏度
sum()kurt()样本峰度
mean()平均值quantile(q)统计分位数,q表示样本所在位置
median()中值,50%分位数apply()Generic apply
min()最小值cov()Unbiased covariance (binary)
max()最大值corr()Correlation (binary)
std()标准差
var()方差
cumsum()累计和cummin()累计最小值
cumprod()累计积cummax()累计最大值
isin(arr)判断每个值是否属于数组arr中的元素
unique()得到唯一值数组,仅支持Series
value_counts()值计数,仅支持Series
df = pd.DataFrame({"value1": np.arange(6),
                  "value2": np.random.rand(6) * 10})
'''
   value1    value2
0       0  6.635426
1       1  2.356701
2       2  4.686041
3       3  1.935749
4       4  2.595264
5       5  3.502797
'''

df.count()
'''
value1    6
value2    6
dtype: int64
'''
df.sum()
'''
value1    15.000000
value2    21.711979
dtype: float64
'''
df.mean()
'''
value1    2.500000
value2    3.618663
dtype: float64
'''
df.median()
'''
value1    2.500000
value2    3.049031
dtype: float64
'''
df.min()
'''
value1    0.000000
value2    1.935749
dtype: float64
'''
df['value2'].max() # 6.63542632315
df.std()
'''
value1    1.870829
value2    1.773453
dtype: float64
'''
df.var()
'''
value1    3.500000
value2    3.145134
dtype: float64
'''
df.skew()
'''
value1    0.000000
value2    1.119899
dtype: float64
'''
df.kurt()
'''
value1   -1.20000
value2    0.52777
dtype: float64
'''
df.quantile(q=0.75)
'''
value1    3.75000
value2    4.39023
dtype: float64
'''
df.cov()
'''
          value1    value2
value1  3.500000 -1.769775
value2 -1.769775  3.145135
'''
df.corr()
'''
          value1    value2
value1  1.000000 -0.533414
value2 -0.533414  1.000000
'''
df.cumsum()
'''
   value1     value2
0     0.0   6.635426
1     1.0   8.992127
2     3.0  13.678168
3     6.0  15.613917
4    10.0  18.209181
5    15.0  21.711978
'''
df.cumprod()
'''
   value1       value2
0     0.0     6.635426
1     0.0    15.637715
2     0.0    73.278974
3     0.0   141.849701
4     0.0   368.137422
5     0.0  1289.510657
'''
df.cummin()
'''
   value1    value2
0     0.0  6.635426
1     0.0  2.356701
2     0.0  2.356701
3     0.0  1.935749
4     0.0  1.935749
5     0.0  1.935749
'''
df.cummax()
'''
   value1    value2
0     0.0  6.635426
1     1.0  6.635426
2     2.0  6.635426
3     3.0  6.635426
4     4.0  6.635426
5     5.0  6.635426
'''

只统计数字列,可以通过索引单独统计一列。
skipna参数:是否忽略NaN,默认为True;值为False时,有NaN的列统计结果为NaN。
axis参数:默认为0,以列来计算,axis=1,以行来计算。

df = pd.DataFrame({'key1':[4, 5, 3, np.nan, 2],
                   'key2':[1, 2, 3, 'j', 'k']},
                  index = ['a', 'b', 'c', 'd', 'e'])
'''
   key1 key2
a   4.0    1
b   5.0    2
c   3.0    3
d   NaN    j
e   2.0    k
'''
df['key1'].dtype # float64
df['key2'].dtype # object

df.mean()
'''
key1    3.5
dtype: float64
'''
df["key1"].mean() # 3.5
df.mean(skipna = False)
'''
key1   NaN
dtype: float64
'''
df.mean(axis = 1)
'''
a    4.0
b    5.0
c    3.0
d    NaN
e    2.0
dtype: float64
'''

isin()

s = pd.Series(np.arange(3,6))
'''
0    3
1    4
2    5
dtype: int32
'''
s.isin([4,10])
'''
0    False
1     True
2    False
dtype: bool
'''

df = pd.DataFrame({'key1': list('asdcbv'),
                   'key2': np.arange(4, 10)})
'''
  key1  key2
0    a     4
1    s     5
2    d     6
3    c     7
4    b     8
5    v     9
'''
df.isin(["s", "sd", "v", 5, "9"])
'''
    key1   key2
0  False  False
1   True   True
2  False  False
3  False  False
4  False  False
5   True  False
'''

unique()和value_counts()

s = pd.Series(list("fdsgfdgdgsagaaew"))
s.unique() # 得到唯一值组成的数组
# ['f' 'd' 's' 'g' 'a' 'e' 'w']
# <class 'numpy.ndarray'>
pd.Series(s.unique()) # 通过pd.Series重新变成新的Series
'''
0    f
1    d
2    s
3    g
4    a
5    e
6    w
dtype: object
'''

s.value_counts()
'''
g    4
a    3
d    3
s    2
f    2
w    1
e    1
dtype: int64
'''
s.value_counts(sort = False)
'''
d    3
e    1
g    4
a    3
w    1
f    2
s    2
dtype: int64
'''

二、字符串方法

Pandas文本字符串方法

pandas中操作字符串的方法:

  • 通过.str调用字符串方法,基本上和python内建的字符串方法同名;
  • 可以对Series使用,Dataframe可以提取Series,如df.columns或df[“key1”]等;
  • 自动跳过缺失/NA值。
方法描述
[i]、.get(i)获取字符串(数组也支持)第i个元素,[]中可以是切片获取多个位置的元素
len()计算字符串长度
count()对符合规则的进行计数
upper()、lower()转为大、小写字母
capitalize()首字母大写
strip()、lstrip()、rstrip()去掉两端、左、右空格
startswith()、endswith()以某规则开始、结束
replace( , , n=i)将匹配到的子串或正则表达式替换为另外的字符串,n为替换个数
split( , expand=False/True, n=i)、rsplit()基于分隔符切分、逆向切分字符串,expand是否拓展为DataFrame,n限制分割次数
s = pd.Series(['A', 'b', 'C', 'bbhello', '123', np.nan, 'hj'])
df = pd.DataFrame({"key1": ["A ", "b", "vf sc ", np.nan, " ffj", "123 "],
                  "key2  ":list("hijkmn")})
s.str.len()
'''
0    1.0
1    1.0
2    1.0
3    7.0
4    3.0
5    NaN
6    2.0
dtype: float64
'''
df["key1"].str.count('f')
'''
0    0.0
1    0.0
2    1.0
3    NaN
4    2.0
5    0.0
Name: key1, dtype: float64
'''
df["key1"].str.startswith('v')
'''
0    False
1    False
2     True
3      NaN
4    False
5    False
Name: key1, dtype: object
'''
df.columns = df.columns.str.rstrip().str.upper() # Index(['KEY1', 'KEY2'], dtype='object')
df["KEY1"] = df["KEY1"].str.strip()
df
'''
    KEY1 KEY2
0      A    h
1      b    i
2  vf sc    j
3    NaN    k
4    ffj    m
5    123    n
'''

[]和get()

s = pd.Series(['ab,b,c', '1,2,3', 'C', ['a,,,c','a,b'], np.nan])
s.str[1]
'''
0      b
1      ,
2    NaN
3    a,b
4    NaN
dtype: object
'''
s.str.get(1)
'''
0      b
1      ,
2    NaN
3    a,b
4    NaN
dtype: object
'''
s.str[0::2]
'''
0        a,,
1        123
2          C
3    [a,,,c]
4        NaN
dtype: object
'''

replace()

s = pd.Series(['A', 'b', 'bbb hello', '123 4', 'hj'])
s.str.replace(" ", "_")
'''
0            A
1            b
2    bbb_hello
3        123_4
4           hj
dtype: object
'''
s.str.replace("b", "C", n = 1)
'''
0            A
1            C
2    Cbb hello
3        123 4
4           hj
dtype: object
'''

split()和rsplit()

s = pd.Series(['ab,b,c', '1,2,3', 'C', ['a,,,c','a,b'], np.nan])
s.str.split(",")
'''
0    [ab, b, c]
1     [1, 2, 3]
2           [C]
3           NaN
4           NaN
dtype: object
'''
s.str.split(",").str[1]
'''
0      b
1      2
2    NaN
3    NaN
4    NaN
dtype: object
'''
s.str.rsplit(",")
'''
0    [ab, b, c]
1     [1, 2, 3]
2           [C]
3           NaN
4           NaN
dtype: object
'''

s.str.split(",", n = 1)
'''
0    [ab, b,c]
1     [1, 2,3]
2          [C]
3          NaN
4          NaN
dtype: object
'''
s.str.split(",", expand = True)
'''
     0     1     2
0   ab     b     c
1    1     2     3
2    C  None  None
3  NaN  None  None
4  NaN  None  None
'''
s.str.split(",", expand = True, n = 1)
'''
     0     1
0   ab   b,c
1    1   2,3
2    C  None
3  NaN  None
4  NaN  None
'''
s.str.rsplit(",", expand = True, n = 1)
'''
      0     1
0  ab,b     c
1   1,2     3
2     C  None
3   NaN  None
4   NaN  None
'''

三、合并 merge()、join()

合并merge()

pd.merge(left, right, how=‘inner’, on=None, sort=True,
left_on=None, right_on=None,
left_index=False, right_index=False,
suffixes=(’_x’, ‘_y’), copy=True, indicator=False)

left:第一个df
right:第二个df

on:合并的参考键,多个写成[“key1”, “key2”]数组形式

how:合并方式

  • inner,取交集(默认)
  • outer,取并集,数据缺失为NaN
  • left,按照left为参考合并,数据缺失为NaN
  • right,按照right为参考合并,数据缺失为NaN

left_on, right_on, left_index, right_index:当两个DataFrame的键不同时,可以单独设置左键与右键,可以相互结合left_on + right_on, left_on + right_index, left_index + right_on, left_index + right_index

sort:按字典顺序通过连接键对结果DataFrame进行排序。默认为False,False时大幅提高性能。
尽量不通过sort排序,有需要时直接用Dataframe的排序方法sort_values和sort_index。

suffixes:当两个DataFrame中包含相同的index且该index不为连接键,给该index加后缀加以区分。

merge()与SQL等关系数据库、Excel的vlookup类似。

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
df3 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
df4 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                    'key2': ['K0', 'K0', 'K0', 'K0'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})
df1
'''
    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
'''
df2
'''
    C   D key
0  C0  D0  K0
1  C1  D1  K1
2  C2  D2  K2
3  C3  D3  K3
'''
pd.merge(df1, df2, on = "key")
'''
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3
'''

df3
'''
    A   B key1 key2
0  A0  B0   K0   K0
1  A1  B1   K0   K1
2  A2  B2   K1   K0
3  A3  B3   K2   K1
'''
df4
'''
    C   D key1 key2
0  C0  D0   K0   K0
1  C1  D1   K1   K0
2  C2  D2   K1   K0
3  C3  D3   K2   K0
'''
pd.merge(df3, df4, on = ['key1', 'key2'])
pd.merge(df3, df4, on = ['key1', 'key2'], how = "inner")
'''
    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
1  A2  B2   K1   K0  C1  D1
2  A2  B2   K1   K0  C2  D2
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "outer")
'''
     A    B key1 key2    C    D
0   A0   B0   K0   K0   C0   D0
1   A1   B1   K0   K1  NaN  NaN
2   A2   B2   K1   K0   C1   D1
3   A2   B2   K1   K0   C2   D2
4   A3   B3   K2   K1  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "left")
'''
    A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A2  B2   K1   K0   C2   D2
4  A3  B3   K2   K1  NaN  NaN
'''
pd.merge(df3, df4, on = ['key1', 'key2'], how = "right")
'''
     A    B key1 key2   C   D
0   A0   B0   K0   K0  C0  D0
1   A2   B2   K1   K0  C1  D1
2   A2   B2   K1   K0  C2  D2
3  NaN  NaN   K2   K0  C3  D3
'''
df1 = pd.DataFrame({'lkey': list('bbacaab'),
                    'data1': range(7)},
                   index = list('abcdefg'))
df2 = pd.DataFrame({'rkey': list('abd'),
                   'data2': range(2, 5)})
df1
'''
   data1 lkey
a      0    b
b      1    b
c      2    a
d      3    c
e      4    a
f      5    a
g      6    b
'''
df2
'''
   data2 rkey
0      2    a
1      3    b
2      4    d
'''

pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey')
'''
   data1 lkey  data2 rkey
0      0    b      3    b
1      1    b      3    b
2      6    b      3    b
3      2    a      2    a
4      4    a      2    a
5      5    a      2    a
'''
pd.merge(df1, df2, left_index = True, right_on = "rkey")
'''
   data1 lkey  data2 rkey
0      0    b      2    a
1      1    b      3    b
2      3    c      4    d
'''

pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey', sort = True)
'''
   data1 lkey  data2 rkey
0      2    a      2    a
1      4    a      2    a
2      5    a      2    a
3      0    b      3    b
4      1    b      3    b
5      6    b      3    b
'''
pd.merge(df1, df2, left_on = 'lkey', right_on = 'rkey').sort_values("data1")
'''
   data1 lkey  data2 rkey
0      0    b      3    b
1      1    b      3    b
3      2    a      2    a
4      4    a      2    a
5      5    a      2    a
2      6    b      3    b
'''

join():直接通过索引连接。
df1.join(df2)相当于pd.merge(df1, df2, left_index = True, right_index = True, how = “left”)

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'KEY': ['K1', 'K2', 'K3']},
                    index = ['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                    index = ['K0', 'K2', 'K3'])
df1
'''
     A   B KEY
K0  A0  B0  K1
K1  A1  B1  K2
K2  A2  B2  K3
'''
df2
'''
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3
'''
df1.join(df2)
# pd.merge(df1, df2, left_index = True, right_index = True, how = "left")
'''
     A   B KEY    C    D
K0  A0  B0  K1   C0   D0
K1  A1  B1  K2  NaN  NaN
K2  A2  B2  K3   C2   D2
'''
df1.join(df2, how = "inner")
'''
     A   B KEY   C   D
K0  A0  B0  K1  C0  D0
K2  A2  B2  K3  C2  D2
'''
df1.join(df2, on = "KEY")
# pd.merge(df1, df2, left_on = "KEY", right_index = True, how = "left")
'''
     A   B KEY    C    D
K0  A0  B0  K1  NaN  NaN
K1  A1  B1  K2   C2   D2
K2  A2  B2  K3   C3   D3
'''

四、连接与修补 concat()、combine_first()、update()

连接concat()

pd.concat(objs, axis=0, join=‘outer’, join_axes=None, keys=None,
ignore_index=False, levels=None, names=None, verify_integrity=False, copy=True)

objs:待连接的Series或DataFrame组成的数组

axis:连接的方向,默认为0,表示行与行连接;可设置为1,表示列与列连接

join:连接方式,如何处理在其它轴的索引

  • outer,并集(默认)
  • inner,交集

join_axes:指定连接后的index

keys:作为最外层索引

s1 = pd.Series([1, 2, 3], index = ['a', 'b', 'c'])
'''
a    1
b    2
c    3
dtype: int64
'''
s2 = pd.Series([2, 3, 4], index = ['b', 'c', 'd'])
'''
b    2
c    3
d    4
dtype: int64
'''

pd.concat([s1, s2])
'''
a    1
b    2
c    3
b    2
c    3
d    4
dtype: int64
'''

pd.concat([s1, s2], axis = 1)
'''
     0    1
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
'''

pd.concat([s1, s2], axis = 1, join = "inner")
'''
   0  1
b  2  2
c  3  3
'''

pd.concat([s1, s2], axis = 1, join_axes = [["a", "c"]])
'''
   0    1
a  1  NaN
c  3  3.0
'''

res1 = pd.concat([s1, s2], keys = ["one", "two"]) # <class 'pandas.core.series.Series'>
'''
one  a    1
     b    2
     c    3
two  b    2
     c    3
     d    4
dtype: int64
'''
res1.index
'''
MultiIndex(levels=[['one', 'two'], ['a', 'b', 'c', 'd']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 1, 2, 3]])
'''
res1.loc["one"]
'''
a    1
b    2
c    3
dtype: int64
'''
res1["one"]["c"] # 3

res2 = pd.concat([s1, s2], axis = 1, keys = ["one", "two"]) # <class 'pandas.core.frame.DataFrame'>
'''
   one  two
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
'''
res2.index # Index(['a', 'b', 'c', 'd'], dtype='object')

combine_first():根据index,df1的空值被df2替代;如果df2的index多于df1,则更新到df1上。

update():相同index位置,df2覆盖df1;如果df2的index多于df1,不会更新到df1。

df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, np.nan, np.nan], [np.nan, 7., np.nan]])
'''
     0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN  NaN
2  NaN  7.0  NaN
'''
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4], [1., 2., 3.]], index=[1, 2, 3])
'''
      0    1    2
1 -42.6  NaN -8.2
2  -5.0  1.6  4.0
3   1.0  2.0  3.0
'''
df1.combine_first(df2)
'''
     0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN -8.2
2 -5.0  7.0  4.0
3  1.0  2.0  3.0
'''

df1.update(df2) # 返回值为None,更改了df1本身
df1
'''
      0    1    2
0   NaN  3.0  5.0
1 -42.6  NaN -8.2
2  -5.0  1.6  4.0
'''

五、去重与替换 duplicated()、drop_duplicates()、replace()

duplicated():返回布尔型Series,通过布尔判断,得到不重复的值
drop_duplicates():得到不重复的值,inplace参数是否替换原值,默认False

s = pd.Series([1, 3, 2, 2, 1, 1, 2, 2, 3, 4, 5, 5])
s.duplicated()
'''
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9     False
10    False
11     True
dtype: bool
'''
s[s.duplicated() == False] # 通过布尔判断,得到不重复的值
'''
0     1
1     3
2     2
9     4
10    5
dtype: int64
'''

df = pd.DataFrame({'key1': ['a', 'a', 3, 4, 5],
                   'key2': ['a', 'a', 'b', 'b', 'c']})
df.duplicated() # 判断行与行之间是否重复
'''
0    False
1     True
2    False
3    False
4    False
dtype: bool
'''

s.drop_duplicates()
'''
0     1
1     3
2     2
9     4
10    5
dtype: int64
'''

replace():可传入列表或字典,一次性替换一个值或多个值。

s = pd.Series(list('ascaasd'))
s.replace('a', np.nan) # 一对一
'''
0    NaN
1      s
2      c
3    NaN
4    NaN
5      s
6      d
dtype: object
'''
s.replace(['a', 's'], np.nan) # 多对一
'''
0    NaN
1    NaN
2      c
3    NaN
4    NaN
5    NaN
6      d
dtype: object
'''
s.replace({'a': 'Hello', 's': 123}) # 多对多
'''
0    Hello
1      123
2        c
3    Hello
4    Hello
5      123
6        d
dtype: object
'''
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值