学习pandas，掌握这四个方面就够了

最新推荐文章于 2024-08-21 00:07:22 发布

方寸之间　

最新推荐文章于 2024-08-21 00:07:22 发布

阅读量1.2k

点赞数

文章标签：学习 python

本文链接：https://blog.csdn.net/weixin_43108397/article/details/123147543

版权

前言

掌握了pandas这四个方面就可以灵活的解决很多工作中的数据处理、统计分析等任务

一、pandas的查询

1. df.loc方法：

**loc取值方式较为重要，后跟[], []内有两个参数，第一个参数选取行，第二个参数选取列**

示例：
新建一个dataFrame

import pandas as pd
data = {'Chinese': [66, 95, 93, 90, 80], 'English': [65, 85, 92, 88, 90], 'Math': [30, 98, 96, 77, 90]}
df = pd.DataFrame(data)

   Chinese  English  Math
0       66       65    30
1       95       85    98
2       93       92    96
3       90       88    77
4       80       90    90

取值1，2行，Chinese和Math列

df_new = df.loc[1:2, ["Chinese", "Math"]]
print(df_new)

   Chinese  Math
1       95    98
2       93    96

2.表达式查询

查询数学成绩和英语成绩都大于90人


df2 = df.loc[(df["Math"] > 90) & (df["English"] > 90), :]

   Chinese  English  Math
2       93       92    96

二、pandas赋值交换

1.直接赋值

添加一列总成绩
df[‘Chinese’]是series类型，相当于添加了一列series,series可以运算

df['total_score'] = df['Chinese'] + df['English'] + df['Math']

   Chinese  English  Math  total_score
0       66       65    30          161
1       95       85    98          278
2       93       92    96          281
3       90       88    77          255
4       80       90    90          260

2.apply方法

apply方法可以添加表达式
按照成绩进行划分等级

def get_type(x):
    if x['total_score'] > 250:
        return '优秀'
    elif 200 <= x['total_score'] <= 250:
        return '良好'
    else:
        return '不及格'

# axis=1设置传入的index是列名
df.loc[:, 'type'] = df.apply(get_type, axis=1)

   Chinese  English  Math  total_score type
0       66       65    30          161  不及格
1       95       85    98          278   优秀
2       93       92    96          281   优秀
3       90       88    77          255   优秀
4       80       90    90          260   优秀

3.按条件分组直接赋值

方式不同，效果一样

df.loc[df['total_score'] > 250, 'new_type'] = '优秀'
df.loc[(df['total_score'] >= 200) & (df['total_score'] <= 250), 'new_type'] = '良好'
df.loc[df['total_score'] < 200, 'new_type'] = '不及格'

   Chinese  English  Math  total_score type new_type
0       66       65    30          161  不及格      不及格
1       95       85    98          278   优秀       优秀
2       93       92    96          281   优秀       优秀
3       90       88    77          255   优秀       优秀
4       80       90    90          260   优秀       优秀

三 pandas统计函数

1.describe查看所有数字统计

print(df.describe())

         Chinese    English       Math  total_score
count   5.000000   5.000000   5.000000     5.000000
mean   84.800000  84.000000  78.200000   247.000000
std    11.987493  10.931606  28.163807    49.360916
min    66.000000  65.000000  30.000000   161.000000
25%    80.000000  85.000000  77.000000   255.000000
50%    90.000000  88.000000  90.000000   260.000000
75%    93.000000  90.000000  96.000000   278.000000
max    95.000000  92.000000  98.000000   281.000000

2.查看单个series的mean, max,min,sum

分别求中位数，最大值，最小值，和

print(df['Chinese'].mean(), df['Chinese'].max(), df['Chinese'].min(), df['Chinese'].sum())

84.8 95 66 424

四 pandas 分组聚合

1.merge方法，合并两张表

给df添加一列学号，删除两列

df.loc[:, 'student_id'] = pd.Series([1, 2, 3, 4, 5])
df = df.drop(['new_type', 'type'], axis=1)

   Chinese  English  Math  total_score  student_id
0       66       65    30          161           1
1       95       85    98          278           2
2       93       92    96          281           3
3       90       88    77          255           4
4       80       90    90          260           5

新建一个df2

data = {'student_id': [1, 2, 3, 4, 5], 'sports': [73, 98, 88, 60, 45], 'class': [1, 1, 2, 1, 2]}
df2 = pd.DataFrame(data)

   student_id  sports  class
0           1      73      1
1           2      98      1
2           3      88      2
3           4      60      1
4           5      45      2

合并（on以student_id为key, how以left为左连接）：

df3 = pd.merge(df, df2, on='student_id', how='left')
print(df3)

   Chinese  English  Math  total_score  student_id  sports  class
0       66       65    30          161           1      73      1
1       95       85    98          278           2      98      1
2       93       92    96          281           3      88      2
3       90       88    77          255           4      60      1
4       80       90    90          260           5      45      2

2.groupby分组统计

a).分组统计求和

使用df3, 按班级分组统计求和

df4 = df3.groupby('class').sum()
print(df4)

       Chinese  English  Math  total_score  student_id  sports
class                                                         
1          251      238   205          694           7     231
2          173      182   186          541           8     133

b)多级索引，多条件分组和agg

df5 = df3.groupby(['class', 'student_id']).max()
print(df5)

                  Chinese  English  Math  total_score  sports
class student_id                                             
1     1                66       65    30          161      73
      2                95       85    98          278      98
      4                90       88    77          255      60
2     3                93       92    96          281      88
      5                80       90    90          260      45

df6 = df3.groupby('class').agg({"Chinese": np.max, "total_score": [np.max, np.min]})
print(df6)

      Chinese total_score     
         amax        amax amin
class                         
1          95         278  161
2          93         281  260

3.groupby分组对象

通过groupby分组，会将dataFrame通过key分组成几组DataFrame

df7 = df3.groupby('class')
for name, groupDataFrame in df7:
    print('班级：', name)
    print(groupDataFrame)

班级： 1
   Chinese  English  Math  total_score  student_id  sports  class
0       66       65    30          161           1      73      1
1       95       85    98          278           2      98      1
3       90       88    77          255           4      60      1
班级： 2
   Chinese  English  Math  total_score  student_id  sports  class
2       93       92    96          281           3      88      2
4       80       90    90          260           5      45      2