实际应用中通常需要基于数据框已有列,计算产生新的数据列。本文通过基于计算三列的结果增加列示例进行讲解。
数据框apply函数
df['all_matching'] = df.apply(lambda x: x.col1 == x.col2 == x.col3, axis = 1)
上述代码创建新all_matching 列,它的值为布尔类型,当三列相等为True,否则为False。
增加计算列示例
首先定义示例数据:
import pandas as pd
#create DataFrame
df = pd.DataFrame({'A': [4, 0, 3, 3, 6, 8, 7, 9, 12],
'B': [4, 2, 3, 5, 6, 4, 7, 7, 12],
'C': [4, 0, 3, 5, 5, 10, 7, 9, 12]})
#view DataFrame
print(df)
# A B C
# 0 4 4 4
# 1 0 2 0
# 2 3 3 3
# 3 3 5 5
# 4 6 6 5
# 5 8 4 10
# 6 7 7 7
# 7 9 7 9
# 8 12 12 12
下面基于A,B,C三列创建新的列all_matching :
#create new column that displays whether or not all column values match
df['all_matching'] = df.apply(lambda x: x.A == x.B == x.C, axis = 1)
#view updated DataFrame
print(df)
# A B C all_matching
# 0 4 4 4 True
# 1 0 2 0 False
# 2 3 3 3 True
# 3 3 5 5 False
# 4 6 6 5 False
# 5 8 4 10 False
# 6 7 7 7 True
# 7 9 7 9 False
# 8 12 12 12 True
计算行均值和汇总
既然apply参数为lambda表达式,当然可以自定义计算逻辑:
df['sum'] = df.apply(lambda x : x.A+x.B+x.C, axis=1)
df['avg'] = df.apply(lambda x : (x.A+x.B+x.C)/3.0, axis=1)
print(df)
# A B C sum avg
# 0 4 4 4 12 4.000000
# 1 0 2 0 2 0.666667
# 2 3 3 3 9 3.000000
# 3 3 5 5 13 4.333333
# 4 6 6 5 17 5.666667
# 5 8 4 10 22 7.333333
# 6 7 7 7 21 7.000000
# 7 9 7 9 25 8.333333
# 8 12 12 12 36 12.000000
按条件计算列
下面示例计算列名称包括Sales的列:
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Kevin', 'Evan', 'Jane', 'Kyra', 'Melissa'],
'January_Sales': [90, 95, 75, 93, 60, 85, 75],
'February_Sales': [95, 95, 75, 65, 50, 85, 100],
'March_Sales': [100, 95, 50, 75, 90, 50, 80],
'Some Random Number': [1,2,3,4,5,6,7]
})
print(df.head())
# 过滤出需要计算的列
sales_columns = [col for col in df.columns if 'Sales' in col]
print(sales_columns)
# Returns: ['January_Sales', 'February_Sales', 'March_Sales']
# 增加计算条件
df['Total Sales'] = df[sales_columns].sum(axis=1)
print(df.head())
返回结果:
Name January_Sales February_Sales March_Sales TotalSales
0 Nik 90 95 100 285
1 Kate 95 95 95 285
2 Kevin 75 75 50 200
3 Evan 93 65 75 233
4 Jane 60 50 90 200
追加汇总行
有时我们需要增加汇总行,下面示例如何增加汇总行:
# 定义元组列表
salary_of_employees = [('Amit', 2000, 2050, 1099, 2134, 2111),
('Rabi', 2122, 3022, 3456, 3111, 2109),
('Abhi', np.NaN, 2334, 2077, np.NaN, 3122),
('Naresh', 3050, 3050, 2010, 2122, 1111),
('Suman', 2023, 2232, 3050, 2123, 1099),
('Viroj', 2050, 2510, np.NaN, 3012, 2122),
('Nabin', 4000, 2000, 2050, np.NaN, 2111)]
# 通过元组列表创建数据框
test = pd.DataFrame(salary_of_employees,
columns=['Name', 'Jan', 'Feb', 'March', 'April', 'May'])
# 设置数据框索引列
test.set_index('Name', inplace=True)
# 计算汇总行作为数据框新的行
total = test.sum()
total.name = 'Total'
# 给数据框追加新行
test = test.append(total.transpose())
print(test)
运行结果:
Jan Feb March April May
Name
Amit 2000.0 2050.0 1099.0 2134.0 2111.0
Rabi 2122.0 3022.0 3456.0 3111.0 2109.0
Abhi NaN 2334.0 2077.0 NaN 3122.0
Naresh 3050.0 3050.0 2010.0 2122.0 1111.0
Suman 2023.0 2232.0 3050.0 2123.0 1099.0
Viroj 2050.0 2510.0 NaN 3012.0 2122.0
Nabin 4000.0 2000.0 2050.0 NaN 2111.0
Total 15245.0 17198.0 13742.0 12502.0 13785.0