和鲸社区-Numpy+Pandas数据处理·闯关-关卡3

STEP1: 根据要求计算下列题目

import pandas as pd
# 读取pandas120数据文件
df = pd.read_excel('/home/mw/input/pandas1206855/pandas120.xlsx')
df2 = pd.read_excel('/home/mw/input/pandas1206855/pandas120.xlsx')
df.head()

在这里插入图片描述

1. 将salary列数据转换为最大值与最小值的平均值

# apply + 自定义函数
def fun(x):
    list1 = x.split("k-")
    a = float(list1[0])
    b = float(list1[1].strip("k"))
    res = (b + a)*0.5*1000
    return int(res)
df["salary"] = df["salary"].map(fun)
df.head()

在这里插入图片描述

2. 计算salary列最大最小值之差(极差),设置列名为ptp
计算的是:上一步求出来的salary平均值这一列的最大值与最小值之差,最终算出来这一列的极差其实是同一个数;

import numpy as np

# # 方法一:max(),min()
# arr = df["salary"].max() - df["salary"].min()
# ptp = pd.DataFrame({"ptp":[arr for i in range( len(df) )]} )
# df = pd.concat((df,ptp), axis = 1)

# 方法二:apply + lambda


# 方法三:numpy.ptp()函数
ptp = pd.DataFrame({"ptp":[np.ptp(df["salary"]) for i in range(len(df))] } )
df = pd.concat((df,ptp), axis = 1)

df.head()

/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)

在这里插入图片描述

3. 新增一列根据salary列数值大小划分为三个层次[‘低’, ‘中’, ‘高’],列名命名category;
低:(0, 5000] 中:(5000, 20000] 高:(20000, 50000]

提示:使用pd.cut(df[‘salary’], bins, labels=group_names)

df['category'] = pd.cut(df["salary"], bins = [0, 5000, 20000, 50000], labels = ["低", "中", "高"])
df.head()

#注:pandas.cut用来把一组数据分割成离散的区间

在这里插入图片描述

4. 根据createTime列,拆分两列字段:日期(年-月-日格式)、小时,分别命名date,hour;

arr1 = []
arr2 = []
for i in range(len(df)):
    temp = df.iloc[i][0].to_pydatetime()
    arr1.append(temp.strftime("%Y-%m-%d"))
    arr2.append(int(temp.strftime("%H")))
date_hour = pd.DataFrame({"date":arr1,"hour":arr2})
df = pd.concat((df,date_hour), axis = 1)

df.head()

在这里插入图片描述

5. 统计在2020-03-16这一天,每个小时的平均工资、平均极差(每行数据极差都是41500)、本科和硕士学历个数,薪资层次中高中低的个数,数据框展示的列分别有date,hour,mean_salary,mean_ptp,count_college,count_master,count_low,count_meddle,count_high;并将平均工资保留两位小数,最后按照date,hour升序排序;
⚠️mean_salary 平均工资保留两位小数
⚠️按照date,hour升序排序
⚠️请注意按照要求输出答案

import datetime
# 筛选出2020-03-16这一天的数据
temp = df[df["date"].isin(["2020-03-16"])]

# # 按照date和hour分组统计
example = temp[["date","hour"]].drop_duplicates(["date","hour"]).sort_values("hour",ascending = True)
example.index = [0,1,2]
example

# df4
df4 = temp.sort_values("hour",ascending = True)

mean_salary = df4.groupby(["date","hour"])["salary"].mean().round(2)# mean_salary
mean_ptp = df4.groupby(["date","hour"])["ptp"].mean().round(2)# mean_ptp
count_college = pd.DataFrame(df4.loc[df4.education=="本科"].groupby(["date","hour","education"]).education.count())# count_college
count_master = pd.DataFrame(df4.loc[df4.education=="硕士"].groupby(["date","hour","education"]).education.count())# count_master
count_low = pd.DataFrame(df4.loc[df4.category=="低"].groupby(["date","hour","category"]).category.count())
count_meddle = pd.DataFrame(df4.loc[df4.category=="中"].groupby(["date","hour","category"]).category.count())
count_high = pd.DataFrame(df4.loc[df4.category=="高"].groupby(["date","hour","category"]).category.count())# count_high

#数据框拼接
x = pd.merge(example, mean_salary, on = ["date","hour"])
x = pd.merge(x, mean_ptp,  on = ["date","hour"])
x = pd.merge(x, count_college, how='left', on = ["date","hour"])
x = pd.merge(x, count_master, how='left', on = ["date","hour"])

y = pd.merge(count_low, count_meddle, how='right', on = ["date","hour"])
y = pd.merge(y, count_high, how='right', on = ["date","hour"])

res = pd.merge(x, y, how='left', on = ["date","hour"])
df2 = res

# 将df2的列名修改成题目要求的列名
df2.columns = ['date', 'hour', 'mean_salary', 'mean_ptp', 'count_college', 'count_master', 'count_low', 'count_meddle', 'count_high']

# 将含有nan的列数据类型转为int
df2["count_master"] = df2["count_master"].fillna(0)
df2["count_master"] = df2["count_master"].astype("int")
df2["count_master"] = df2["count_master"].astype("str")
df2["count_low"] = df2["count_low"].fillna(0)
df2["count_low"] = df2["count_low"].astype("int")
df2["count_low"] = df2["count_low"].astype("str")

df2.head()

在这里插入图片描述

6. 将三列数据合并成一列,并设置列名为answer,同时设置索引列为id

data = pd.concat([df2.iloc[:,0],df2.iloc[:,1],df2.iloc[:,2],df2.iloc[:,3],df2.iloc[:,4],df2.iloc[:,5],df2.iloc[:,6],df2.iloc[:,7],df2.iloc[:,8]])
df3 = pd.DataFrame(data, columns=['answer'])
df3['id'] = range(len(df3))
df3 = df3[['id', 'answer']]
df3

在这里插入图片描述

STEP2: 将结果保存为 csv 文件 保存文件到本地

df3.to_csv('answer_3.csv', index=False, encoding='utf-8-sig')
  • 5
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值