Pandas 实现循环的三大利器

最新推荐文章于 2024-04-11 16:23:49 发布

Xiaofei@IDO

最新推荐文章于 2024-04-11 16:23:49 发布

阅读量2.3k

点赞数 1

分类专栏： python知识点文章标签： python

本文链接：https://blog.csdn.net/nixiang_888/article/details/108559374

版权

python知识点专栏收录该内容

27 篇文章 3 订阅

订阅专栏

1. 概述

在实际工作中，基于 pandas进行数据处理的时候，经常会对数据框中的单行、多行（列也适用）甚至是整个数据进行某种相同方式的处理，比如将数据中的 sex字段将男替换成1，女替换成0。

其中，for循环是一种简单、直接的方式，但运行效率很低。本文介绍了 pandas中的三大利器： map、apply、applymap。

2. 数据模拟

import pandas as pd
import numpy as np

boolean = [True, False]
gender = ["Male", "Female"]
color = ["white", "black", "red"]

df = pd.DataFrame(
    {
        "height": np.random.randint(160, 190, 100),
        "weight": np.random.randint(60, 90, 100),
        "smoker": [boolean[x] for x in np.random.randint(0, 2, 100)],
        "gender": [gender[x] for x in np.random.randint(0, 2, 100)],
        "age": np.random.randint(20, 60, 100),
        "color": [color[x] for x in np.random.randint(0, len(color), 100)]
    }
)

3. pandas.Series.map(arg, na_action=None)

| 描述：基于arg的映射关系，替换Series中的每一个值

参数：

arg：可以是一个函数，字典或者Series序列
na_action：{None, ignore}，如果为‘ignore’，则忽略 NaN 值，即不将NaN作为映射关系。

返回值：

返回 Series 对象

Example 01

# arg可以是一个字典或Series，并基于映射关系替换序列的值；对于不存在映射关系的元素，则返回NaN值
import pandas as pd
import numpy as np

s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])

sm1 = s.map({'cat':'kitten', 'dog':'puppy'})
print(sm1)
# results
0    kitten
1     puppy
2       NaN
3       NaN
dtype: object

Example 02

# arg也可以是一个函数
sm2 = s.map("It is a {}".format)
print(sm2)
# results
0       It is a cat
1       It is a dog
2       It is a nan
3    It is a rabbit
dtype: object

# -------------------- #
sm2 = s.map("It is a {}".format, na_action = "ignore")
print(sm2)
# results
0       It is a cat
1       It is a dog
2               NaN
3    It is a rabbit
dtype: object
# 注意添加 na_action = 'ignore' 的区别

Example 03 - pd.DataFrame

# 01 - 基于字典形式的映射关系
dic = {'Male': 0, 'Female': 1}
df1 = df.copy()
df1['gender'] = df1['gender'].map(dic)
print(df1.head())
# results
    height  weight  smoker  gender  age  color
0      160      60   False       1   44  white
1      188      88    True       1   52  white
2      189      89   False       1   33  black
3      176      69   False       1   29  white
4      177      81   False       1   34  black

# 02 - 基于函数形式的映射
def map_gender(x):
    gender = 0 if x == 'Male' else 1
    return gender

df2 = df.copy()
df2['gender'] = df2['gender'].map(map_gender)
print(df2.head())
# results
   height  weight  smoker  gender  age  color
0     189      68    True       1   41    red
1     164      81    True       1   24  black
2     160      84    True       0   57  white
3     180      75    True       1   31  black
4     173      61    True       1   30  black

4. apply 函数

4.1 pandas.Series.apply(func, convert_dtype=True, args=(), **kwds)

| 将更加复杂的函数，基于映射关系，应用于Series中的每一个值

参数：

func：自定义函数或者是numpy函数，其第一个参数是Series中的元素
convert_dtype：当为True(默认值)，自动转换其值为合适的数据类型
args：参数元组；传递给func的，位于Series值后的位置参数
kwds：传递给func的关键字参数

返回值：

Series 或者 DataFrame

基于Series的单一参数函数的数值处理

s = pd.Series([20, 21, 12],
              index=['London', 'New York', "Helsinki"])

def square(x):
    return x**2
# 自定义函数
print(s.apply(square))
# results
London      400
New York    441
Helsinki    144
dtype: int64
# 匿名函数
print(s.apply(lambda x : x**2))
# results
London      400
New York    441
Helsinki    144
dtype: int64

基于Series的多个位置参数函数的数值处理

def substract_custom_value(x, custom_value):
    return x - custom_value

print(s.apply(substract_custom_value, args=(5,)))
# results
London      15
New York    16
Helsinki     7
dtype: int64

基于Series的关键字参数函数的数值处理

def add_custom_value(x, **kwargs):
    for month in kwargs:
        x += kwargs[month]
    return x

print(s.apply(func=add_custom_value, june=30, july=20, august=25))
# ------------------------------- #
London      95
New York    96
Helsinki    87
dtype: int64

4.2 pandas.DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)

| 应用于DataFrame的遍历函数

参数

func：应用于每一行或列的函数
axis：0-列；1-行；默认0
raw：布尔值；默认False

Flase：将每一行或列作为Series对象，传递给func
True：将每一行或列作为ndarray对象，传递给func

args：位置参数元组
kwds：关键字参数
result_type：默认值None，{‘expand’, ‘reduce’, ‘broadcast’, None}

仅在axis=1时，发挥作用
'expand：类似列表的结果转换为列
‘reduce’：

返回值：

Series或DataFrame对象

df = pd.DataFrame([[4,9]]*3, columns=['A', 'B'])

# 默认axis=0,对应于列
print(df.apply(np.sqrt))
# results
A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

# axis = 0 --- 列
print(df.apply(np.sum,axis=0))
# results
A    12
B    27
dtype: int64

# axis = 1 --- 行
print(df.apply(np.sum, axis=1))
# results
0    13
1    13
2    13
dtype: int64

result_type参数理解

# 默认None，返回值含有列表元素
print(df.apply(lambda x : [1, 2], axis=1))
0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

# ‘expand’：将列表展开为DataFrame的列，（其实产生了DataFrame对象）
print(df.apply(lambda x : [1, 2], axis=1, result_type='expand'))
   0  1
0  1  2
1  1  2
2  1  2

#‘reduce’：是‘expand’的反义 
print(df.apply(lambda x : [1, 2], axis=1, result_type='reduce'))
0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

# 广播：保持原有的列明和行名，其实就是在原有数据框的基础上修改数据框。
# 注意，要进行修改的数据可以小于等于数据框的形状，但不可以大于
df.apply(lambda x: 1, axis = 1, result_type = 'broadcast')
Out[40]: 
   A  B
0  1  1
1  1  1
2  1  1

应用于模拟数据

def apply_age(x,bias):
    return x + bias

df4 = df.copy()
# df4["age"]当做第一个值传给apply_age函数，args是第二个参数
df4["age"] = df4["age"].apply(apply_age,args=(-3,))

df4.head()
Out[47]: 
    height  weight  smoker  gender  age  color
0      179      83    True    Male   17    red
1      178      67    True  Female   38  white
2      177      76    True  Female   33    red
3      175      62   False    Male   46  white
4      168      63   False    Male   38    red


# 实现计算BMI指数：体重/身高的平方(kg/m^2)
def BMI(x):
	# x 是Series对象
    weight = x["weight"]
    height = x["height"] / 100
    BMI = weight / (height **2)

    return BMI

df5 = df.copy() 
# df5现在就相当于BMI函数中的参数x；axis=1表示在行上操作
# 每取一行就是一个Series对象
df5["BMI"] = df5.apply(BMI,axis=1)
df5.head()
Out[48]: 
    height  weight  smoker  gender  age  color        BMI
0      179      83    True    Male   20    red  25.904310
1      178      67    True  Female   41  white  21.146320
2      177      76    True  Female   36    red  24.258674
3      175      62   False    Male   49  white  20.244898
4      168      63   False    Male   41    red  22.321429

DataFrame型数据的 apply操作总结：
1. 当 axis=0时，对每列执行指定函数；当 axis=1时，对每行执行指定函数。
2. 无论 axis=0还是 axis=1，其传入指定函数的默认形式均为 Series，可以通过设置 raw=True传入 numpy数组。
3. 对每个Series执行结果后，会将结果整合在一起返回（若想有返回值，定义函数时需要 return相应的值）

5. pandas.DataFrame.applymap(func)

| 将func应用于数据框中的每一个元素

df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])

df.applymap(lambda x: len(str(x)))
Out[53]: 
   0  1
0  3  4
1  5  5

df.applymap(lambda x: x**2)
Out[54]: 
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

Xiaofei@IDO

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
Pandas 实现循环的三大利器

1. 概述在实际工作中，基于 pandas进行数据处理的时候，经常会对数据框中的单行、多行（列也适用）甚至是整个数据进行某种相同方式的处理，比如将数据中的 sex字段将男替换成1，女替换成0。其中，for循环是一种简单、直接的方式，但运行效率很低。本文介绍了 pandas中的三大利器： map、apply、applymap。2. 数据模拟import pandas as pdimport numpy as npboolean = [True, False]gender = ["Male",
复制链接

扫一扫