Python科学计算之Pandas groupby、字符串、索引高阶操作

最新推荐文章于 2023-07-15 14:21:26 发布

蜜桃上的小叮当

最新推荐文章于 2023-07-15 14:21:26 发布

阅读量1.5k

点赞数 1

分类专栏： Python科学计算文章标签： python 大数据数据分析

本文链接：https://blog.csdn.net/sinat_31854967/article/details/109062004

版权

Python科学计算专栏收录该内容

17 篇文章 0 订阅

订阅专栏

文章目录

Groupby操作

建立一个DataFrame结构进行groupby操作

import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                           'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

在这里插入图片描述

#以A的数据作为分组统计
grouped = df.groupby('A')
grouped.count()

在这里插入图片描述

#以A、B的数据作为分组统计
grouped = df.groupby('A','B')
grouped.count()

在这里插入图片描述

#自定义函数分组统计
def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'a'
    else:
        return 'b'
grouped = df.groupby(get_letter_type,axis = 1)
grouped.count().iloc[0]
#输出
a    1
b    3
Name: 0, dtype: int64

groupby中的索引操作

#构建一个Series序列
s = pd.Series([1,2,3,1,2,3],[9,8,7,9,8,7])
s
#输出
9    1
8    2
7    3
9    1
8    2
7    3
dtype: int64
#利用参数level，指明聚合的层级，默认是排序的，如果不想排序可以设置sort = False
grouped = s.groupby(level = 0)
#也可以使用first和last函数
grouped.first()
grouped.last()
#输出
7    3
8    2
9    1
dtype: int64

#同样我们也可以对其求和或者平均值mean()
grouped.sum()
#输出
7    6
8    4
9    2

#指定分组查看

df2 = pd.DataFrame({'X':['A','B','A','B'],'Y':[1,2,3,4]})
#只查看A
df2.groupby('X').get_group('A')

在这里插入图片描述

groupby中的多重索引

#构建多重索引结构
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays,names = ['first','second'])
s = pd.Series(np.random.randn(8),index = index)
s
#输出
first  second
bar    one      -1.606608
       two      -0.789309
baz    one      -1.440065
       two       2.042331
foo    one       0.291961
       two       0.753495
qux    one       0.893588
       two      -1.004419
dtype: float64

#多重索引的groupby操作，这里level也可以指定first或者second
grouped = s.groupby(level = 0)
grouped.sum()
#输出
first
bar   -2.395917
baz    0.602266
foo    1.045456
qux   -0.110831
dtype: float64

#使用numpy的求和
grouped = df.groupby(['A','B'])
grouped.aggregate(np.sum)

在这里插入图片描述

#我们可以对齐索引格式
grouped = df.groupby(['A','B'],as_index = False)
grouped.aggregate(np.sum)

在这里插入图片描述

#重新定义索引
df.groupby(['A','B']).sum().reset_index()

在这里插入图片描述

groupby其他函数表示

#size求出数据个数
grouped = df.groupby(['A','B'])
grouped.size()

在这里插入图片描述

#describe各项指标
grouped.describe().head()

在这里插入图片描述

grouped = df.groupby('A')
grouped['C'].agg([np.sum,np.mean,np.std])

在这里插入图片描述

字符串操作

大小写转换

import pandas as pd
import numpy as np
s = pd.Series(['A','B','c','hello','HELLO',np.nan])
#小写转换
s.str.lower()
#大写转换
s.str.upper()

在这里插入图片描述

字符串长度

s.str.len()
#输出
0    1.0
1    1.0
2    1.0
3    5.0
4    5.0
5    NaN
dtype: float64

index去空格操作

index = pd.Index(['  Louis','  Cauchy   '])
index
#输出
Index(['  Louis', '  Cauchy   '], dtype='object')

#去所有空格
index.str.strip()
#输出
Index(['Louis', 'Cauchy'], dtype='object')
#去左边空格
index.str.lstrip()
#去右边空格
index.str.rstrip()

#DataFrame里替换空格操作
df = pd.DataFrame(np.random.randn(3,2),columns = ['A a','B b'],index = range(3))
df.columns = df.columns.str.replace(' ','_')
df

在这里插入图片描述

字符串切分操作

s = pd.Series(['a_b_c','D_E_F','G_H_I'])
s
#输出
0    a_b_c
1    D_E_F
2    G_H_I
dtype: object

#split切分
s.str.split('_')
#输出
0    [a, b, c]
1    [D, E, F]
2    [G, H, I]
dtype: object

#分开成列
s.str.split('_',expand = True)

在这里插入图片描述

#切分限制，n表示切分几次
s.str.split('_',expand = True,n=1)

在这里插入图片描述

包含关系

s = pd.Series(['A','Ab','Abcd','Abcde','Abcdef',])
s
#输出
0         A
1        Ab
2      Abcd
3     Abcde
4    Abcdef
dtype: object

#contains('Abc')
s.str.contains('Abc')
#输出
0    False
1    False
2     True
3     True
4     True
dtype: bool

字符串切分判断可能性，有设为1，没有设为0

s = pd.Series(['a','a|b','a|c'])
s
#输出
0      a
1    a|b
2    a|c
dtype: object

#输出成一张表结构，分割符为'|'
s.str.get_dummies(sep = '|')

在这里插入图片描述

索引操作

简单构建一个倒序索引

import pandas as pd
import numpy as np
s = pd.Series(np.arange(5),index = np.arange(5)[::-1],dtype='int64')
s
#输出
4    0
3    1
2    2
1    3
0    4
dtype: int64

isin判断数值是否在索引里

s.isin([0,2,4])
#输出
4     True
3    False
2     True
1    False
0     True
dtype: bool

s[s.isin([0,2,4])]
#输出
4    0
2    2
0    4
dtype: int64

#找出大于2的值
s[s>2]
#输出
1    3
0    4
dtype: int64

#构建多重索引
s2 = pd.Series(np.arange(6),index = pd.MultiIndex.from_product([[0,1],['a','b','c']]))
s2
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int32
#寻找多重索引里面的值
s2.iloc[s2.index.isin([(0,'a'),(1,'c')])]
#输出
0  a    0
1  c    5
dtype: int32

select选择函数

#构建一个时间数据表
dates = pd.date_range('20201001',periods=8)
df = pd.DataFrame(np.random.randn(8,4),index=dates,columns=['A','B','C','D'])
df

在这里插入图片描述

#通过匿名函数lamba取出索引=A列的值
df.select(lamba x:x=='A',axis='columns')

where操作

df.where(df < 0)

在这里插入图片描述

#将不是<0的数全取负
df.where(df < 0,-df)

在这里插入图片描述

#将不是<0的数全取某个数值
df.where(df < 0,1)

在这里插入图片描述

query查询

df = pd.DataFrame(np.random.rand(10,3),columns = list('abc'))
df

在这里插入图片描述

#查询a<b的行
df.query('(a<b)')

在这里插入图片描述

#查询a<b<c的行
df.query('(a<b) & (b<c)')

在这里插入图片描述

蜜桃上的小叮当

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
Python科学计算之Pandas groupby、字符串、索引高阶操作

文章目录Groupby操作字符串操作Groupby操作建立一个DataFrame结构进行groupby操作import pandas as pdimport numpy as npdf = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 't
复制链接

扫一扫