和鲸社区-Numpy+Pandas数据处理·闯关-关卡3

最新推荐文章于 2024-04-19 22:00:38 发布

二喵来杯Java

最新推荐文章于 2024-04-19 22:00:38 发布

阅读量1.6k

点赞数 5

分类专栏： Python科学计算文章标签： pandas numpy python

本文链接：https://blog.csdn.net/qq_54206777/article/details/127971769

版权

Python科学计算专栏收录该内容

4 篇文章 2 订阅

订阅专栏

STEP1: 根据要求计算下列题目

import pandas as pd
# 读取pandas120数据文件
df = pd.read_excel('/home/mw/input/pandas1206855/pandas120.xlsx')
df2 = pd.read_excel('/home/mw/input/pandas1206855/pandas120.xlsx')
df.head()

在这里插入图片描述

1. 将salary列数据转换为最大值与最小值的平均值

# apply + 自定义函数
def fun(x):
    list1 = x.split("k-")
    a = float(list1[0])
    b = float(list1[1].strip("k"))
    res = (b + a)*0.5*1000
    return int(res)
df["salary"] = df["salary"].map(fun)
df.head()

在这里插入图片描述

2. 计算salary列最大最小值之差（极差），设置列名为ptp
计算的是：上一步求出来的salary平均值这一列的最大值与最小值之差，最终算出来这一列的极差其实是同一个数；

import numpy as np

# # 方法一：max()，min()
# arr = df["salary"].max() - df["salary"].min()
# ptp = pd.DataFrame({"ptp":[arr for i in range( len(df) )]} )
# df = pd.concat((df,ptp), axis = 1)

# 方法二：apply + lambda


# 方法三：numpy.ptp()函数
ptp = pd.DataFrame({"ptp":[np.ptp(df["salary"]) for i in range(len(df))] } )
df = pd.concat((df,ptp), axis = 1)

df.head()

/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)

在这里插入图片描述

3. 新增一列根据salary列数值大小划分为三个层次[‘低’, ‘中’, ‘高’]，列名命名category；
低：(0, 5000] 中：(5000, 20000] 高：(20000, 50000]

提示：使用pd.cut(df[‘salary’], bins, labels=group_names)

df['category'] = pd.cut(df["salary"], bins = [0, 5000, 20000, 50000], labels = ["低", "中", "高"])
df.head()

#注：pandas.cut用来把一组数据分割成离散的区间

在这里插入图片描述

4. 根据createTime列，拆分两列字段:日期(年-月-日格式)、小时，分别命名date，hour；

arr1 = []
arr2 = []
for i in range(len(df)):
    temp = df.iloc[i][0].to_pydatetime()
    arr1.append(temp.strftime("%Y-%m-%d"))
    arr2.append(int(temp.strftime("%H")))
date_hour = pd.DataFrame({"date":arr1,"hour":arr2})
df = pd.concat((df,date_hour), axis = 1)

df.head()

在这里插入图片描述

5. 统计在2020-03-16这一天，每个小时的平均工资、平均极差(每行数据极差都是41500)、本科和硕士学历个数，薪资层次中高中低的个数，数据框展示的列分别有date，hour，mean_salary，mean_ptp，count_college，count_master，count_low，count_meddle，count_high；并将平均工资保留两位小数，最后按照date，hour升序排序；
⚠️mean_salary 平均工资保留两位小数
⚠️按照date，hour升序排序
⚠️请注意按照要求输出答案

import datetime
# 筛选出2020-03-16这一天的数据
temp = df[df["date"].isin(["2020-03-16"])]

# # 按照date和hour分组统计
example = temp[["date","hour"]].drop_duplicates(["date","hour"]).sort_values("hour",ascending = True)
example.index = [0,1,2]
example

# df4
df4 = temp.sort_values("hour",ascending = True)

mean_salary = df4.groupby(["date","hour"])["salary"].mean().round(2)# mean_salary
mean_ptp = df4.groupby(["date","hour"])["ptp"].mean().round(2)# mean_ptp
count_college = pd.DataFrame(df4.loc[df4.education=="本科"].groupby(["date","hour","education"]).education.count())# count_college
count_master = pd.DataFrame(df4.loc[df4.education=="硕士"].groupby(["date","hour","education"]).education.count())# count_master
count_low = pd.DataFrame(df4.loc[df4.category=="低"].groupby(["date","hour","category"]).category.count())
count_meddle = pd.DataFrame(df4.loc[df4.category=="中"].groupby(["date","hour","category"]).category.count())
count_high = pd.DataFrame(df4.loc[df4.category=="高"].groupby(["date","hour","category"]).category.count())# count_high

#数据框拼接
x = pd.merge(example, mean_salary, on = ["date","hour"])
x = pd.merge(x, mean_ptp,  on = ["date","hour"])
x = pd.merge(x, count_college, how='left', on = ["date","hour"])
x = pd.merge(x, count_master, how='left', on = ["date","hour"])

y = pd.merge(count_low, count_meddle, how='right', on = ["date","hour"])
y = pd.merge(y, count_high, how='right', on = ["date","hour"])

res = pd.merge(x, y, how='left', on = ["date","hour"])
df2 = res

# 将df2的列名修改成题目要求的列名
df2.columns = ['date', 'hour', 'mean_salary', 'mean_ptp', 'count_college', 'count_master', 'count_low', 'count_meddle', 'count_high']

# 将含有nan的列数据类型转为int
df2["count_master"] = df2["count_master"].fillna(0)
df2["count_master"] = df2["count_master"].astype("int")
df2["count_master"] = df2["count_master"].astype("str")
df2["count_low"] = df2["count_low"].fillna(0)
df2["count_low"] = df2["count_low"].astype("int")
df2["count_low"] = df2["count_low"].astype("str")

df2.head()

在这里插入图片描述

6. 将三列数据合并成一列，并设置列名为answer，同时设置索引列为id

data = pd.concat([df2.iloc[:,0],df2.iloc[:,1],df2.iloc[:,2],df2.iloc[:,3],df2.iloc[:,4],df2.iloc[:,5],df2.iloc[:,6],df2.iloc[:,7],df2.iloc[:,8]])
df3 = pd.DataFrame(data, columns=['answer'])
df3['id'] = range(len(df3))
df3 = df3[['id', 'answer']]
df3

在这里插入图片描述

STEP2: 将结果保存为 csv 文件保存文件到本地

df3.to_csv('answer_3.csv', index=False, encoding='utf-8-sig')

二喵来杯Java

关注

5
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
和鲸社区-Numpy+Pandas数据处理·闯关-关卡3

和鲸社区-Numpy+Pandas数据处理·闯关-关卡3
复制链接

扫一扫

专栏目录

和鲸社区-Numpy+Pandas数据处理·闯关-关卡3

“相关推荐”对你有帮助么？