Pandas练习题 (二)

import numpy as np
import pandas as pd

Ex1:公司员工数据集

1. 分别只使用queryloc选出年龄不超过四十岁且工作部门为DairyBakery的男性。

# 自己写的
df.query('(age<=40)&((department == "Dairy")|(department == "Bakery"))')
df.loc[(df.age<= 40) & ((df.department == 'Dairy')|( df.department == 'Bakery'))]
df.loc[df.age<= 40 & df.department.isin(['Dairy','Bakery'])]
EmployeeIDbirthdate_keyagecity_namedepartmentjob_titlegender
# 答案
dpt = ['Dairy', 'Bakery']
df.query("(age <= 40)&(department == @dpt)&(gender=='M')").head(3)
df.loc[(df.age<=40)&df.department.isin(dpt)&(df.gender=='M')].head(3)
EmployeeIDbirthdate_keyagecity_namedepartmentjob_titlegender

== 注意@的使用 ==

对于query中的字符串,如果要引用外部变量,只需在变量名前加@符号。

2. 选出员工ID号 为奇数所在行的第1、第3和倒数第2列。

df.iloc[(df.EmployeeID%2==1).values,[1,3,-2]]
birthdate_keycity_namejob_title
11/3/1957VancouverVP Stores

3. 按照以下步骤进行索引操作:

* 把后三列设为索引后交换内外两层
# 自己写的
df.set_index(['department','job_title','gender'])
df.swaplevel(0,2,axis=0)
---------------------------------------------------------------------------


TypeError: Can only swap levels on a hierarchical axis.

== 注意 这里报错的原因是因为设定index后没有保存数据 ==
正确的:

dfsw = df.set_index(['department','job_title','gender'])
dfsw.swaplevel(0,2,axis=0)
EmployeeIDbirthdate_keyagecity_name
genderjob_titledepartment
MCEOExecutive13181/3/195461Vancouver
FVP StoresExecutive13191/3/195758Vancouver
Legal CounselExecutive13201/2/195560Vancouver
#  答案
df_op = df.copy()
df_op = df_op.set_index(df_op.columns[-3:].tolist()).swaplevel(0,2,axis=0)
df_op
# df_op = df_op.set_index(list(df_op.columns[-3:])).swaplevel(0,2,axis=0)
EmployeeIDbirthdate_keyagecity_name
genderjob_titledepartment
MCEOExecutive13181/3/195461Vancouver
FVP StoresExecutive13191/3/195758Vancouver
Legal CounselExecutive13201/2/195560Vancouver
* 恢复中间层索引
df_op.reset_index()
genderjob_titledepartmentEmployeeIDbirthdate_keyagecity_name
0MCEOExecutive13181/3/195461Vancouver
1FVP StoresExecutive13191/3/195758Vancouver
2FLegal CounselExecutive13201/2/195560Vancouver
df_op = df_op.reset_index(level=2)
# 注意这样的用法
* 修改外层索引名为Gender
df_op = df_op.rename_axis(index={'gender':'Gender'})
df_op
departmentEmployeeIDbirthdate_keyagecity_name
Genderjob_title
MCEOExecutive13181/3/195461Vancouver
FVP StoresExecutive13191/3/195758Vancouver
Legal CounselExecutive13201/2/195560Vancouver
* 用下划线合并两层行索引
# 自己写的
df_temp = df.set_index(df.columns[-3:].tolist()).swaplevel(0,2,axis=0)
df_temp
df_index = df_temp.index.map(lambda x: (x[0]+'_'+x[1]+'_'+x[2]))
df_temp.index = df_index
df_temp
EmployeeIDbirthdate_keyagecity_name
M_CEO_Executive13181/3/195461Vancouver
F_VP Stores_Executive13191/3/195758Vancouver
F_Legal Counsel_Executive13201/2/195560Vancouver
df_op.index = df_op.index.map(lambda x:'_'.join(x))
df_op
departmentEmployeeIDbirthdate_keyagecity_name
M_CEOExecutive13181/3/195461Vancouver
F_VP StoresExecutive13191/3/195758Vancouver
F_Legal CounselExecutive13201/2/195560Vancouver

== 注意 str.join()的用法

* 把行索引拆分为原状态
df_temp = df_op.copy()
# 自己写的
df_temp.reset_index()
indexdepartmentEmployeeIDbirthdate_keyagecity_name
0M_CEOExecutive13181/3/195461Vancouver
1F_VP StoresExecutive13191/3/195758Vancouver
2F_Legal CounselExecutive13201/2/195560Vancouver
# 答案
df_temp = df_op.copy()
df_temp.index = df_temp.index.map(lambda x: tuple(x.split('_')))
df_op = df_temp

== 注意 ==
str.split()的用法

* 修改索引名为原表名称
df_temp = df_op.copy()
df_temp = df_temp.rename_axis(index = ['Gender','job_title'])

df_op = df_op.rename_axis(index=['gender', 'job_title'])
df_op.shape
df_temp = df_op.copy()
df_temp
departmentEmployeeIDbirthdate_keyagecity_name
genderjob_title
MCEOExecutive13181/3/195461Vancouver
FVP StoresExecutive13191/3/195758Vancouver
Legal CounselExecutive13201/2/195560Vancouver
* 恢复默认索引并将列保持为原表的相对位置
# 自己写的
df_existed = pd.DataFrame(index = np.arange(df_op.shape[0]),columns=['EmployeeID','birthdate_key','age','city_name','department','job_title','gender'])
df_temp.reindex_like (df_existed)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

== 注意 这里报错的原因 ==

使用 reindex_like必须保证index的列不能改变,但是可以改变列的行数。

函数名作用
set_index设置某一列为index
reset_indexset_index作用相反
reindex增加index的行数,去除某列
reindex_likereindex作用相同
df_existed = pd.DataFrame(index = np.arange(df_op.shape[0]),columns=['EmployeeID','birthdate_key','age','city_name','department','job_title','gender'])
df_temp.reset_index().reindex_like (df_existed)
EmployeeIDbirthdate_keyagecity_namedepartmentjob_titlegender
013181/3/195461VancouverExecutiveCEOM
113191/3/195758VancouverExecutiveVP StoresF
213201/2/195560VancouverExecutiveLegal CounselF
# 答案
df.columns  #Index(['EmployeeID', 'birthdate_key', 'age', 'city_name', 'department','job_title', 'gender'],dtype='object')
df_op = df_op.reset_index().reindex(df.columns, axis=1)
df_op
df_op.equals(df)
True

注意 .equals 的用法

Ex2:巧克力数据集

现有一份关于巧克力评价的数据集:

data = [['A. Morin',2016,'63%','France',3.75],['A. Morin',2015,'70%','France',2.75],['A. Morin',2015,'70%','France',3.00]]
df = pd.DataFrame(data = data,index = np.arange(3),columns=['Company','Review Date','Cocoa Percent','Company Location','Rating'])
# df.to_csv('.//chocolate.csv')
df
CompanyReview DateCocoa PercentCompany LocationRating
0A. Morin201663%France3.75
1A. Morin201570%France2.75
2A. Morin201570%France3.00

1. 把列索引名中的\n替换为空格。

df.columns
new_columns = [' '.join(i.split('\n')) for i in df.columns]
df.columns = new_columns

2. 巧克力Rating评分为1至5,每0.25分一档,请选出2.75分及以下且可可含量Cocoa Percent高于中位数的样本。

df['Cocoa Percent'] = df['Cocoa Percent'].apply(lambda x: float(x[:,-1])/100)

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

TypeError: string indices must be integers

== 注意 错误的原因 ==
float(x[:,-1]这里错了 ,应该是 float(x[:-1]或者`float(x[0:-1]’

df['Cocoa Percent'] = df['Cocoa Percent'].apply(lambda x: float(x[0:-1])/100)
df.query(' Rating<= 2.75 &  Cocoa Percent >= Cocoa Percent.median')
SyntaxError: Python keyword not valid identifier in numexpr query

== 注意 ==
再query中对于有 空格 的属性,不能直接用,要加上`Cocoa Percent`来使用

不是单引号 !!!

df.query('( Rating<= 2.75 )& (`Cocoa Percent`>=`Cocoa Percent`.median())')
CompanyReview DateCocoa PercentCompany LocationRating
1A. Morin20150.7France2.75

3. 将Review DateCompany Location设为索引后,选出Review Date在2012年之后且Company Location不属于France, Canada, Amsterdam, Belgium的样本。

df.set_index(['Review Date','Company Location'])
CompanyCocoa PercentRating
Review DateCompany Location
2016FranceA. Morin0.633.75
2015FranceA. Morin0.702.75
FranceA. Morin0.703.00
Location = ['France', 'Canada', 'Amsterdam', 'Belgium']
df.query('(`Review Date`>2012) & (!`Company Location`==@Location)')
SyntaxError: invalid syntax
idx = pd.IndexSlice
exclude = ['France', 'Canada', 'Amsterdam', 'Belgium']
res = df.set_index(['Review Date', 'Company Location']).sort_index(level=0)
res.loc[idx[2012:,~res.index.get_level_values(1).isin(exclude)],:].head(3)

== 注意 ==
索引是不能用 query的方法查找的。
get_level_values
sort_index
了解这些方法

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值