【练习-pandas新手入门】和鲸社区Numpy+Pandas数据处理·闯关-关卡3-数据处理、合并与分组

盛迪嘉白

于 2024-04-20 17:01:47 发布

阅读量635

点赞数 9

文章标签： pandas numpy

本文链接：https://blog.csdn.net/xiaoqian19/article/details/138004138

版权

文章详细描述了如何使用Pandas对Excel数据进行操作，包括计算salary列的统计值、日期时间处理以及按条件进行分组统计，展示了Python数据处理的实际应用。

摘要由CSDN通过智能技术生成

根据要求计算下列题目

# 读取pandas120数据文件

df = pd.read_excel('/home/mw/input/pandas1206855/pandas120.xlsx')

df.head()

1. 将salary列数据转换为最大值与最小值的平均值

# apply + 自定义函数
df['max_salary']=df['salary'].str.extract('\d+k-(\d+)k').astype(int)*1000
df['min_salary']=df['salary'].str.extract('(\d+)k-\d+k').astype(int)*1000
df['salary']=((df['max_salary']+df['min_salary'])/2).astype(int)

df.head()

2. 计算salary列最大最小值之差（极差），设置列名为ptp

计算的是：上一步求出来的salary平均值这一列的最大值与最小值之差，最终算出来这一列的极差其实是同一个数；

# 方法一：max()，min()
#df['ptp']=df['salary'].max()-df['salary'].min()
# 方法二：apply + lambda
#df['ptp']=df['salary'].apply(lambda x:df['salary'].max()-df['salary'].min() )
# 方法三：numpy.ptp()函数
df['ptp']=np.ptp(df['salary'])
df.head()

3. 新增一列根据salary列数值大小划分为三个层次['低', '中', '高']，列名命名category；

低：(0, 5000]
中：(5000, 20000]
高：(20000, 50000]

提示：使用pd.cut(df['salary'], bins, labels=group_names)

bins=[0,5000,20000,50000]
group_names=['低','中','高']
df['category'] = pd.cut(df['salary'],bins=bins,labels=group_names)
df
# 注：pandas.cut用来把一组数据分割成离散的区间

4. 根据createTime列，拆分两列字段:日期(年-月-日格式)、小时，分别命名date，hour；

# 将createTime列转换为日期时间格式
df['createTime'] = pd.to_datetime(df['createTime'])

#错误！
#df['date']=df['createTime'].dt.strftime('%Y-%m-%d') #注意大小写!
#df['hour']=df['createTime'].dt.strftime('%H:%M:%S').str.slice(0,2).str.lstrip('0')#.str.slice(0, 2) 提取字符串的前两个字符（即小时部分） ；str.lstrip('0') 去除前导零，不然最终会报‘3回答错误’

df['date']=df['createTime'].dt.strftime('%Y-%m-%d') #注意大小写!
df['hour'] = df['createTime'].apply(lambda x:x.hour) 



df.head()

5. 统计在2020-03-16这一天，每个小时的平均工资、平均极差(每行数据极差都是41500)、本科和硕士学历个数，薪资层次中高中低的个数，数据框展示的列分别有date，hour，mean_salary，mean_ptp，count_college，count_master，count_low，count_meddle，count_high；并将平均工资保留两位小数，最后按照date，hour升序排序；

⚠️mean_salary 平均工资保留两位小数
⚠️按照date，hour升序排序
⚠️请注意按照要求输出答案

import datetime
# 筛选出2020-03-16这一天的数据
df = df[df['createTime'].dt.date == pd.to_datetime('2020-03-16').date()]

# 对education和category进行dummy处理
df_dummies = pd.get_dummies(df, columns=['education', 'category'])

# 按照date和hour分组统计
df2 = df_dummies.groupby(['date', 'hour']).agg({
    'salary': 'mean',
    'ptp': 'mean',
    'education_本科': 'sum',
    'education_硕士': 'sum',
    'category_低': 'sum',
    'category_中': 'sum',
    'category_高': 'sum'
}).reset_index()


# 四舍五入mean_salary到两位小数
df2['salary'] = df2['salary'].round(2)

# 将df2的列名修改成题目要求的列名
df2.columns = ['date', 'hour', 'mean_salary', 'mean_ptp', 'count_college', 'count_master', 'count_low', 'count_meddle', 'count_high']
df2.head()

6. 将三列数据合并成一列，并设置列名为answer，同时设置索引列为id

data = pd.concat([df2.iloc[:,0],df2.iloc[:,1],df2.iloc[:,2],df2.iloc[:,3],df2.iloc[:,4],df2.iloc[:,5],df2.iloc[:,6],df2.iloc[:,7],df2.iloc[:,8]])

df3 = pd.DataFrame(data, columns=['answer'])
df3['id'] = range(len(df3))
df3 = df3[['id', 'answer']]

df3

盛迪嘉白

关注

9
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
【练习-pandas新手入门】和鲸社区Numpy+Pandas数据处理·闯关-关卡3-数据处理、合并与分组

⚠️mean_salary 平均工资保留两位小数⚠️按照date，hour升序排序⚠️请注意按照要求输出答案# 筛选出2020-03-16这一天的数据# 对education和category进行dummy处理# 按照date和hour分组统计'education_本科': 'sum','education_硕士': 'sum','category_低': 'sum','category_中': 'sum','category_高': 'sum'
复制链接

扫一扫