这十套练习,教你如何用Pandas做数据分析

目录

练习题索引

对应的数据集文件路径查看

练习1-开始了解你的数据

探索Chipotle快餐数据

步骤1 导入必要的库

步骤2 从如下地址导入数据集

步骤3 将数据集存入一个名为chipo的数据框内

步骤4 查看前10行内容

步骤6 数据集中有多少个列(columns)

步骤7 打印出全部的列名称

步骤8 数据集的索引是怎样的

步骤9 被下单数最多商品(item)是什么?

步骤10 在item_name这一列中,一共有多少种商品被下单?

步骤11 在choice_description中,下单次数最多的商品是什么?

步骤12 一共有多少商品被下单?

步骤13 将item_price转换为浮点数

步骤14 在该数据集对应的时期内,收入(revenue)是多少

步骤15 在该数据集对应的时期内,一共有多少订单?

步骤16 每一单(order)对应的平均总价是多少?

步骤17 一共有多少种不同的商品被售出?

练习2-数据过滤与排序

探索2012欧洲杯数据

步骤1 - 导入必要的库

步骤2 - 从以下地址导入数据集

步骤3 - 将数据集命名为euro12

步骤4 只选取 Goals 这一列

步骤5 有多少球队参与了2012欧洲杯?

步骤6 该数据集中一共有多少列(columns)?

步骤7 将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框

步骤8 对数据框discipline按照先Red Cards再Yellow Cards进行排序

步骤9 计算每个球队拿到的黄牌数的平均值

步骤10 找到进球数Goals超过6的球队数据

步骤11 选取以字母G开头的球队数据

步骤12 选取前7列

步骤13 选取除了最后3列之外的全部列

步骤14 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

练习3-数据分组

探索酒类消费数据

步骤1 导入必要的库

步骤2 从以下地址导入数据

步骤3 将数据框命名为drinks

步骤4 哪个大陆(continent)平均消耗的啤酒(beer)更多?

步骤5 打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值

步骤6 打印出每个大陆每种酒类别的消耗平均值

步骤7 打印出每个大陆每种酒类别的消耗中位数

步骤8 打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值

练习4-Apply函数

探索1960 - 2014 美国犯罪数据

步骤1 导入必要的库

步骤2 从以下地址导入数据集

步骤3 将数据框命名为crime

步骤4 每一列(column)的数据类型是什么样的?

步骤5 将Year的数据类型转换为 datetime64

步骤6 将列Year设置为数据框的索引

步骤7 删除名为Total的列

步骤8 按照Year对数据框进行分组并求和

步骤9 何时是美国历史上生存最危险的年代?

练习5-合并

探索虚拟姓名数据

步骤1 导入必要的库

步骤2 按照如下的元数据内容创建数据框

步骤3 将上述的数据框分别命名为data1, data2, data3

步骤4 将data1和data2两个数据框按照行的维度进行合并,命名为all_data

步骤5 将data1和data2两个数据框按照列的维度进行合并,命名为all_data_col

步骤6 打印data3

步骤7 按照subject_id的值对all_data和data3作合并

步骤8 对data1和data2按照subject_id作连接

步骤9 找到 data1 和 data2 合并之后的所有匹配结果

练习6-统计

探索风速数据

步骤1 导入必要的库

步骤2 从以下地址导入数据

步骤3 将数据作存储并且设置前三列为合适的索引

步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug

步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]

步骤6 对应每一个location,一共有多少数据值缺失

步骤7 对应每一个location,一共有多少完整的数据值

步骤8 对于全体数据,计算风速的平均值

步骤9 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差

步骤10 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差

步骤11 对于每一个location,计算一月份的平均风速

步骤12 对于数据记录按照年为频率取样

步骤13 对于数据记录按照月为频率取样

练习7-可视化

探索泰坦尼克灾难数据

步骤1 导入必要的库

步骤2 从以下地址导入数据

步骤3 将数据框命名为titanic

步骤4 将PassengerId设置为索引

步骤5 绘制一个展示男女乘客比例的扇形图

步骤6 绘制一个展示船票Fare, 与乘客年龄和性别的散点图

步骤7 有多少人生还?

步骤8 绘制一个展示船票价格的直方图

练习8-创建数据框

探索Pokemon数据

步骤1 导入必要的库

步骤2 创建一个数据字典

步骤3 将数据字典存为一个名叫pokemon的数据框中

步骤4 数据框的列排序是字母顺序,请重新修改为name, type, hp, evolution, pokedex这个顺序

步骤5 添加一个列place

步骤6 查看每个列的数据类型

练习9-时间序列

探索Apple公司股价数据

步骤1 导入必要的库

步骤2 数据集地址

步骤3 读取数据并存为一个名叫apple的数据框

步骤4 查看每一列的数据类型

步骤5 将Date这个列转换为datetime类型

步骤6 将Date设置为索引

步骤7 有重复的日期吗?

步骤8 将index设置为升序

步骤9 找到每个月的最后一个交易日(business day)

步骤10 数据集中最早的日期和最晚的日期相差多少天?

步骤11 在数据中一共有多少个月?

步骤12 按照时间顺序可视化Adj Close值

练习10-删除数据

探索Iris纸鸢花数据

步骤1 导入必要的库

步骤2 数据集地址

步骤3 将数据集存成变量iris

步骤4 创建数据框的列名称

步骤5 数据框中有缺失值吗?

步骤6 将列petal_length的第10到19行设置为缺失值

步骤7 将缺失值全部替换为1.0

步骤8 删除列class

步骤9 将数据框前三行设置为缺失值

步骤10 删除有缺失值的行

步骤11 重新设置索引

结语


Pandas是入门Python做数据分析所必须要掌握的一个库。本文内容由科赛网翻译整理自Github,建议读者完成科赛网 从零上手Python关键代码Pandas基础命令速查表 教程学习的之后,点击本篇Notebook右上角的 Fork 按钮对本教程代码进行调试学习。

转载本文请联系 科赛网 取得授权科赛网 是聚合数据人才和行业问题的在线社区,率先打造国内首款K-Lab 在线数据分析协作平台,为数据工作者的学习与工作带来全新的体验。

练习题索引

点击习题编号即可跳转至习题内容。

习题编号内容相应数据集
练习1 - 开始了解你的数据探索Chipotle快餐数据chipotle.tsv
练习2 - 数据过滤与排序探索2012欧洲杯数据Euro2012_stats.csv
练习3 - 数据分组探索酒类消费数据drinks.csv
练习4 -Apply函数探索1960 - 2014 美国犯罪数据US_Crime_Rates_1960_2014.csv
练习5 - 合并探索虚拟姓名数据练习中手动内置的数据
练习6 - 统计探索风速数据wind.data
练习7 - 可视化探索泰坦尼克灾难数据train.csv
练习8 - 创建数据框探索Pokemon数据练习中手动内置的数据
练习9 - 时间序列探索Apple公司股价数据Apple_stock.csv
练习10 - 删除数据探索Iris纸鸢花数据iris.csv

对应的数据集文件路径查看

ls ../input/pandas_exercise/exercise_data/
Apple_stock.csv  drinks.csv          second_cars_info.csv          wechart.csv
cars.csv         Euro2012_stats.csv  train.csv                     wind.data
chipotle.tsv     iris.csv            US_Crime_Rates_1960_2014.csv

练习1-开始了解你的数据

探索Chipotle快餐数据

image description

返回练习题索引

步骤1 导入必要的库

# 运行以下代码
import pandas as pd

步骤2 从如下地址导入数据集

# 运行以下代码
path1 = "../input/pandas_exercise/exercise_data/chipotle.tsv"    # chipotle.tsv

步骤3 将数据集存入一个名为chipo的数据框内

# 运行以下代码
chipo = pd.read_csv(path1, sep = '\t')

步骤4 查看前10行内容

# 运行以下代码
chipo.head(10)
 order_idquantityitem_namechoice_descriptionitem_price
011Chips and Fresh Tomato SalsaNaN$2.39
111Izze[Clementine]$3.39
211Nantucket Nectar[Apple]$3.39
311Chips and Tomatillo-Green Chili SalsaNaN$2.39
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
531Chicken Bowl[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...$10.98
631Side of ChipsNaN$1.69
741Steak Burrito[Tomatillo Red Chili Salsa, [Fajita Vegetables...$11.75
841Steak Soft Tacos[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...$9.25
951Steak Burrito[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...$9.25

步骤6 数据集中有多少个列(columns)

# 运行以下代码
chipo.shape[1]
5

步骤7 打印出全部的列名称

# 运行以下代码
chipo.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

步骤8 数据集的索引是怎样的

# 运行以下代码
chipo.index
RangeIndex(start=0, stop=4622, step=1)

步骤9 被下单数最多商品(item)是什么?

# 运行以下代码,做了修正
c = chipo[['item_name','quantity']].groupby(['item_name'],as_index=False).agg({'quantity':sum})
c.sort_values(['quantity'],ascending=False,inplace=True)
c.head()
 item_namequantity
17Chicken Bowl761
18Chicken Burrito591
25Chips and Guacamole506
39Steak Burrito386
10Canned Soft Drink351

步骤10 在item_name这一列中,一共有多少种商品被下单?

# 运行以下代码
chipo['item_name'].nunique()
50

步骤11 在choice_description中,下单次数最多的商品是什么?

# 运行以下代码,存在一些小问题
chipo['choice_description'].value_counts().head()
[Diet Coke]                                                                          134
[Coke]                                                                               123
[Sprite]                                                                              77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]     40
Name: choice_description, dtype: int64

步骤12 一共有多少商品被下单?

# 运行以下代码
total_items_orders = chipo['quantity'].sum()
total_items_orders
4972

步骤13 将item_price转换为浮点数

# 运行以下代码
dollarizer = lambda x: float(x[1:-1])
chipo['item_price'] = chipo['item_price'].apply(dollarizer)

步骤14 在该数据集对应的时期内,收入(revenue)是多少

# 运行以下代码,已经做更正
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)
chipo['sub_total'].sum()
39237.02

步骤15 在该数据集对应的时期内,一共有多少订单?

# 运行以下代码
chipo['order_id'].nunique()
1834

步骤16 每一单(order)对应的平均总价是多少?

# 运行以下代码,已经做过更正
chipo[['order_id','sub_total']].groupby(by=['order_id']
).agg({'sub_total':'sum'})['sub_total'].mean()
21.39423118865867

步骤17 一共有多少种不同的商品被售出?

# 运行以下代码
chipo['item_name'].nunique()
50

返回练习题索引

练习2-数据过滤与排序

探索2012欧洲杯数据

image description

返回练习题索引

步骤1 - 导入必要的库

# 运行以下代码
import pandas as pd

步骤2 - 从以下地址导入数据集

# 运行以下代码
path2 = "../input/pandas_exercise/exercise_data/Euro2012_stats.csv"      # Euro2012_stats.csv

步骤3 - 将数据集命名为euro12

# 运行以下代码
euro12 = pd.read_csv(path2)
euro12
 TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
0Croatia4131251.9%16.0%32000...1381.3%41622909916
1Czech Republic4131841.9%12.9%39000...960.1%5373870111119
2Denmark4101050.0%20.0%27100...1066.7%25388407715
3England5111850.0%17.2%40000...2288.1%4345650111116
4France3222437.9%6.5%65100...654.6%3651560111119
5Germany10323247.8%15.6%80210...1062.6%63491240151517
6Greece581830.7%19.2%32111...1365.1%67481291121220
7Italy6344543.0%7.5%110200...2074.1%1018916160181819
8Netherlands2123625.0%4.1%60200...1270.6%35303507715
9Poland2152339.4%5.2%48000...666.7%48563717717
10Portugal6224234.3%9.3%82600...1071.5%739010120141416
11Republic of Ireland171236.8%5.2%28000...1765.4%43511161101017
12Russia593122.5%12.5%59200...1077.0%34434607716
13Spain12423355.9%16.0%100010...1593.8%1028319110171718
14Sweden5171947.2%13.8%39300...861.6%35517709918
15Ukraine272621.2%6.0%38000...1376.5%48314509918

16 rows × 35 columns

步骤4 只选取 Goals 这一列

# 运行以下代码
euro12.Goals
0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

步骤5 有多少球队参与了2012欧洲杯?

# 运行以下代码
euro12.shape[0]
16

步骤6 该数据集中一共有多少列(columns)?

# 运行以下代码
euro12.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
Team                          16 non-null object
Goals                         16 non-null int64
Shots on target               16 non-null int64
Shots off target              16 non-null int64
Shooting Accuracy             16 non-null object
% Goals-to-shots              16 non-null object
Total shots (inc. Blocked)    16 non-null int64
Hit Woodwork                  16 non-null int64
Penalty goals                 16 non-null int64
Penalties not scored          16 non-null int64
Headed goals                  16 non-null int64
Passes                        16 non-null int64
Passes completed              16 non-null int64
Passing Accuracy              16 non-null object
Touches                       16 non-null int64
Crosses                       16 non-null int64
Dribbles                      16 non-null int64
Corners Taken                 16 non-null int64
Tackles                       16 non-null int64
Clearances                    16 non-null int64
Interceptions                 16 non-null int64
Clearances off line           15 non-null float64
Clean Sheets                  16 non-null int64
Blocks                        16 non-null int64
Goals conceded                16 non-null int64
Saves made                    16 non-null int64
Saves-to-shots ratio          16 non-null object
Fouls Won                     16 non-null int64
Fouls Conceded                16 non-null int64
Offsides                      16 non-null int64
Yellow Cards                  16 non-null int64
Red Cards                     16 non-null int64
Subs on                       16 non-null int64
Subs off                      16 non-null int64
Players Used                  16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB

步骤7 将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框

# 运行以下代码
discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline
 TeamYellow CardsRed Cards
0Croatia90
1Czech Republic70
2Denmark40
3England50
4France60
5Germany40
6Greece91
7Italy160
8Netherlands50
9Poland71
10Portugal120
11Republic of Ireland61
12Russia60
13Spain110
14Sweden70
15Ukraine50

步骤8 对数据框discipline按照先Red Cards再Yellow Cards进行排序

# 运行以下代码
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)
 TeamYellow CardsRed Cards
6Greece91
9Poland71
11Republic of Ireland61
7Italy160
10Portugal120
13Spain110
0Croatia90
1Czech Republic70
14Sweden70
4France60
12Russia60
3England50
8Netherlands50
15Ukraine50
2Denmark40
5Germany40

步骤9 计算每个球队拿到的黄牌数的平均值

# 运行以下代码
round(discipline['Yellow Cards'].mean())
7.0

步骤10 找到进球数Goals超过6的球队数据

# 运行以下代码
euro12[euro12.Goals > 6]
 TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
5Germany10323247.8%15.6%80210...1062.6%63491240151517
13Spain12423355.9%16.0%100010...1593.8%1028319110171718

2 rows × 35 columns

步骤11 选取以字母G开头的球队数据

# 运行以下代码
euro12[euro12.Team.str.startswith('G')]
 TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Saves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed CardsSubs onSubs offPlayers Used
5Germany10323247.8%15.6%80210...1062.6%63491240151517
6Greece581830.7%19.2%32111...1365.1%67481291121220

2 rows × 35 columns

步骤12 选取前7列

# 运行以下代码
euro12.iloc[: , 0:7]
 TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)
0Croatia4131251.9%16.0%32
1Czech Republic4131841.9%12.9%39
2Denmark4101050.0%20.0%27
3England5111850.0%17.2%40
4France3222437.9%6.5%65
5Germany10323247.8%15.6%80
6Greece581830.7%19.2%32
7Italy6344543.0%7.5%110
8Netherlands2123625.0%4.1%60
9Poland2152339.4%5.2%48
10Portugal6224234.3%9.3%82
11Republic of Ireland171236.8%5.2%28
12Russia593122.5%12.5%59
13Spain12423355.9%16.0%100
14Sweden5171947.2%13.8%39
15Ukraine272621.2%6.0%38

步骤13 选取除了最后3列之外的全部列

# 运行以下代码
euro12.iloc[: , :-3]
 TeamGoalsShots on targetShots off targetShooting Accuracy% Goals-to-shotsTotal shots (inc. Blocked)Hit WoodworkPenalty goalsPenalties not scored...Clean SheetsBlocksGoals concededSaves madeSaves-to-shots ratioFouls WonFouls ConcededOffsidesYellow CardsRed Cards
0Croatia4131251.9%16.0%32000...01031381.3%4162290
1Czech Republic4131841.9%12.9%39000...1106960.1%5373870
2Denmark4101050.0%20.0%27100...11051066.7%2538840
3England5111850.0%17.2%40000...22932288.1%4345650
4France3222437.9%6.5%65100...175654.6%3651560
5Germany10323247.8%15.6%80210...11161062.6%63491240
6Greece581830.7%19.2%32111...12371365.1%67481291
7Italy6344543.0%7.5%110200...21872074.1%1018916160
8Netherlands2123625.0%4.1%60200...0951270.6%3530350
9Poland2152339.4%5.2%48000...083666.7%4856371
10Portugal6224234.3%9.3%82600...21141071.5%739010120
11Republic of Ireland171236.8%5.2%28000...02391765.4%43511161
12Russia593122.5%12.5%59200...0831077.0%3443460
13Spain12423355.9%16.0%100010...5811593.8%1028319110
14Sweden5171947.2%13.8%39300...1125861.6%3551770
15Ukraine272621.2%6.0%38000...0441376.5%4831450

16 rows × 32 columns

步骤14 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

# 运行以下代码
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
 TeamShooting Accuracy
3England50.0%
7Italy43.0%
12Russia22.5%

返回练习题索引

练习3-数据分组

探索酒类消费数据

image description

返回练习题索引

步骤1 导入必要的库

# 运行以下代码
import pandas as pd

步骤2 从以下地址导入数据

# 运行以下代码
path3 ='../input/pandas_exercise/exercise_data/drinks.csv'    #'drinks.csv'

步骤3 将数据框命名为drinks

# 运行以下代码
drinks = pd.read_csv(path3)
drinks.head()
 countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0AS
1Albania89132544.9EU
2Algeria250140.7AF
3Andorra24513831212.4EU
4Angola21757455.9AF

步骤4 哪个大陆(continent)平均消耗的啤酒(beer)更多?

# 运行以下代码
drinks.groupby('continent').beer_servings.mean()
continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

步骤5 打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值

# 运行以下代码
drinks.groupby('continent').wine_servings.describe()
 countmeanstdmin25%50%75%max
continent        
AF53.016.26415138.8464190.01.02.013.00233.0
AS44.09.06818221.6670340.00.01.08.00123.0
EU45.0142.22222297.4217380.059.0128.0195.00370.0
OC16.035.62500064.5557900.01.08.523.25212.0
SA12.062.41666788.6201891.03.012.098.50221.0

步骤6 打印出每个大陆每种酒类别的消耗平均值

# 运行以下代码
drinks.groupby('continent').mean()
 beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent    
AF61.47169816.33962316.2641513.007547
AS37.04545560.8409099.0681822.170455
EU193.777778132.555556142.2222228.617778
OC89.68750058.43750035.6250003.381250
SA175.083333114.75000062.4166676.308333

步骤7 打印出每个大陆每种酒类别的消耗中位数

# 运行以下代码
drinks.groupby('continent').median()
 beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent    
AF32.03.02.02.30
AS17.516.01.01.20
EU219.0122.0128.010.00
OC52.537.08.51.75
SA162.5108.512.06.85

步骤8 打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值

# 运行以下代码
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
 meanminmax
continent   
AF16.3396230152
AS60.8409090326
EU132.5555560373
OC58.4375000254
SA114.75000025302

练习4-Apply函数

探索1960 - 2014 美国犯罪数据

image description

返回练习题索引

步骤1 导入必要的库

# 运行以下代码
import numpy as np
import pandas as pd

步骤2 从以下地址导入数据集

# 运行以下代码
path4 = '../input/pandas_exercise/exercise_data/US_Crime_Rates_1960_2014.csv'    # "US_Crime_Rates_1960_2014.csv"

步骤3 将数据框命名为crime

# 运行以下代码
crime = pd.read_csv(path4)
crime.head()
 YearPopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
01960179323175338420028846030957009110171901078401543209121001855400328200
11961182992000348800028939031986008740172201066701567609496001913000336000
21962185771000375220030151034507008530175501108601645709943002089600366800
319631884830004109500316970379250086401765011647017421010864002297800408300
419641911410004564600364220420040093602142013039020305012132002514400472800

步骤4 每一列(column)的数据类型是什么样的?

# 运行以下代码
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null int64
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: int64(12)
memory usage: 5.2 KB

注意到了吗,Year的数据类型为 int64,但是pandas有一个不同的数据类型去处理时间序列(time series),我们现在来看看。

步骤5 将Year的数据类型转换为 datetime64

# 运行以下代码
crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null datetime64[ns]
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.2 KB

步骤6 将列Year设置为数据框的索引

# 运行以下代码
crime = crime.set_index('Year', drop = True)
crime.head()
 PopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year           
1960-01-01179323175338420028846030957009110171901078401543209121001855400328200
1961-01-01182992000348800028939031986008740172201066701567609496001913000336000
1962-01-01185771000375220030151034507008530175501108601645709943002089600366800
1963-01-011884830004109500316970379250086401765011647017421010864002297800408300
1964-01-011911410004564600364220420040093602142013039020305012132002514400472800

步骤7 删除名为Total的列

# 运行以下代码
del crime['Total']
crime.head()
 PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year          
1960-01-0117932317528846030957009110171901078401543209121001855400328200
1961-01-0118299200028939031986008740172201066701567609496001913000336000
1962-01-0118577100030151034507008530175501108601645709943002089600366800
1963-01-01188483000316970379250086401765011647017421010864002297800408300
1964-01-01191141000364220420040093602142013039020305012132002514400472800
crime.resample('10AS').sum()
 PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year          
1960-01-0119150531754134930451609001061802367201633510215852013321100265477005292100
1970-01-0121211932989607930913838001922305545704159020470212028486000531578009739900
1980-01-0123713700691407432811704890020643986563953831097619130330734947204025311935411
1990-01-01261282525817527048119053499211664998827574893010568963267500157767936614624418
2000-01-0129479691171396805610094436916306892249942303668652124215651766797029111412834
2010-01-011570146307607201744095950728674210591749809376414210125170304016983569080
2020-01-010000000000

步骤8 按照Year对数据框进行分组并求和

*注意Population这一列,若直接对其求和,是不正确的**

# 更多关于 .resample 的介绍
# (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)
# 更多关于 Offset Aliases的介绍 
# (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
# 运行以下代码
crimes = crime.resample('10AS').sum() # resample a time series per decades


# 用resample去得到“Population”列的最大值
population = crime['Population'].resample('10AS').max()

# 更新 "Population" 
crimes['Population'] = population

crimes
 PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year          
1960-01-01201385000.04134930451609001061802367201633510215852013321100265477005292100
1970-01-01220099000.09607930913838001922305545704159020470212028486000531578009739900
1980-01-01248239000.01407432811704890020643986563953831097619130330734947204025311935411
1990-01-01272690813.017527048119053499211664998827574893010568963267500157767936614624418
2000-01-01307006550.01396805610094436916306892249942303668652124215651766797029111412834
2010-01-01318857056.0607201744095950728674210591749809376414210125170304016983569080
2020-01-01NaN000000000

步骤9 何时是美国历史上生存最危险的年代?

# 运行以下代码
crime.idxmax(0)
Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]

返回练习题索引

练习5-合并

探索虚拟姓名数据

返回练习题索引

步骤1 导入必要的库

# 运行以下代码
import numpy as np
import pandas as pd

步骤2 按照如下的元数据内容创建数据框

# 运行以下代码
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

步骤3 将上述的数据框分别命名为data1, data2, data3

# 运行以下代码
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])

步骤4 将data1data2两个数据框按照行的维度进行合并,命名为all_data

# 运行以下代码
all_data = pd.concat([data1, data2])
all_data
 subject_idfirst_namelast_name
01AlexAnderson
12AmyAckerman
23AllenAli
34AliceAoni
45AyoungAtiches
04BillyBonder
15BrianBlack
26BranBalwner
37BryceBrice
48BettyBtisan

步骤5 将data1data2两个数据框按照列的维度进行合并,命名为all_data_col

# 运行以下代码
all_data_col = pd.concat([data1, data2], axis = 1)
all_data_col
 subject_idfirst_namelast_namesubject_idfirst_namelast_name
01AlexAnderson4BillyBonder
12AmyAckerman5BrianBlack
23AllenAli6BranBalwner
34AliceAoni7BryceBrice
45AyoungAtiches8BettyBtisan

步骤6 打印data3

# 运行以下代码
data3
 subject_idtest_id
0151
1215
2315
3461
4516
5714
6815
791
81061
91116

步骤7 按照subject_id的值对all_datadata3作合并

# 运行以下代码
pd.merge(all_data, data3, on='subject_id')
 subject_idfirst_namelast_nametest_id
01AlexAnderson51
12AmyAckerman15
23AllenAli15
34AliceAoni61
44BillyBonder61
55AyoungAtiches16
65BrianBlack16
77BryceBrice14
88BettyBtisan15

步骤8 对data1data2按照subject_id作连接

# 运行以下代码
pd.merge(data1, data2, on='subject_id', how='inner')
 subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
04AliceAoniBillyBonder
15AyoungAtichesBrianBlack

步骤9 找到 data1data2 合并之后的所有匹配结果

# 运行以下代码
pd.merge(data1, data2, on='subject_id', how='outer')
 subject_idfirst_name_xlast_name_xfirst_name_ylast_name_y
01AlexAndersonNaNNaN
12AmyAckermanNaNNaN
23AllenAliNaNNaN
34AliceAoniBillyBonder
45AyoungAtichesBrianBlack
56NaNNaNBranBalwner
67NaNNaNBryceBrice
78NaNNaNBettyBtisan

返回练习题索引

练习6-统计

探索风速数据

image description

返回练习题索引

步骤1 导入必要的库

# 运行以下代码
import pandas as pd
import datetime

步骤2 从以下地址导入数据

import pandas as pd
# 运行以下代码
path6 = "../input/pandas_exercise/exercise_data/wind.data"  # wind.data

步骤3 将数据作存储并且设置前三列为合适的索引

import datetime
# 运行以下代码
data = pd.read_table(path6, sep = "\s+", parse_dates = [[0,1,2]]) 
data.head()
 Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
02061-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
12061-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
22061-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
32061-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
42061-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug

# 运行以下代码
def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)

# data.info()
data.head()
 Yr_Mo_DyRPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
01961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
11961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
21961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
31961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
41961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]

# 运行以下代码
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])

# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')

data.head()
# data.info()
 RPTVALROSKILSHABIRDUBCLAMULCLOBELMAL
Yr_Mo_Dy            
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.04
1961-01-0214.71NaN10.836.5012.627.6711.5010.049.799.6717.5413.83
1961-01-0318.5016.8812.3310.1311.176.1711.25NaN8.507.6712.7512.71
1961-01-0410.586.6311.754.584.542.888.631.795.835.885.4610.88
1961-01-0513.3313.2511.426.1710.718.2111.926.5410.9210.3412.9211.83

步骤6 对应每一个location,一共有多少数据值缺失

# 运行以下代码
data.isnull().sum()
RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

步骤7 对应每一个location,一共有多少完整的数据值

# 运行以下代码
data.shape[0] - data.isnull().sum()
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

步骤8 对于全体数据,计算风速的平均值

# 运行以下代码
data.mean().mean()
10.227982360836924

步骤9 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差

# 运行以下代码
loc_stats = pd.DataFrame()

loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max 
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations

loc_stats
 minmaxmeanstd
RPT0.6735.8012.3629875.618413
VAL0.2133.3710.6443145.267356
ROS1.5033.8411.6605265.008450
KIL0.0028.466.3064683.605811
SHA0.1337.5410.4558344.936125
BIR0.0026.167.0922543.968683
DUB0.0030.379.7973434.977555
CLA0.0031.088.4950534.499449
MUL0.0025.888.4935904.166872
CLO0.0428.218.7073324.503954
BEL0.1342.3813.1210075.835037
MAL0.6742.5415.5990796.699794

步骤10 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差

# 运行以下代码
# create the dataframe
day_stats = pd.DataFrame()

# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max 
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations

day_stats.head()
 minmaxmeanstd
Yr_Mo_Dy    
1961-01-019.2918.5013.0181822.808875
1961-01-026.5017.5411.3363643.188994
1961-01-036.1718.5011.6418183.681912
1961-01-041.7911.756.6191673.198126
1961-01-056.1713.3310.6300002.445356

步骤11 对于每一个location,计算一月份的平均风速

注意,1961年的1月和1962年的1月应该区别对待

# 运行以下代码
# creates a new column 'date' and gets the values from the index
data['date'] = data.index

# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)

# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')

# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()
RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

步骤12 对于数据记录按照年为频率取样

# 运行以下代码
data.query('month == 1 and day == 1')
 RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdatemonthyearday
Yr_Mo_Dy                
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01119611
1962-01-019.293.4211.543.502.211.9610.412.793.545.174.387.921962-01-01119621
1963-01-0115.5913.6219.798.3812.2510.0023.4515.7113.5914.3717.5834.131963-01-01119631
1964-01-0125.8022.1318.2113.2521.2914.7914.1219.5813.2516.7528.9621.001964-01-01119641
1965-01-019.5411.929.004.386.085.2110.256.085.718.6312.0417.411965-01-01119651
1966-01-0122.0421.5017.0812.7522.1715.5921.7918.1216.6617.8328.3323.791966-01-01119661
1967-01-016.464.466.503.216.673.7911.383.837.719.0810.6720.911967-01-01119671
1968-01-0130.0417.8816.2516.2521.7912.5418.1616.6218.7517.6222.2527.291968-01-01119681
1969-01-016.131.635.411.082.541.008.502.424.586.349.1716.711969-01-01119691
1970-01-019.592.9611.793.426.134.089.004.467.293.507.3313.001970-01-01119701
1971-01-013.710.794.710.171.421.044.630.751.541.084.219.541971-01-01119711
1972-01-019.293.6314.544.256.754.4213.005.3310.048.548.7119.171972-01-01119721
1973-01-0116.5015.9214.627.418.2911.2113.547.7910.4610.7913.379.711973-01-01119731
1974-01-0123.2116.5416.089.7515.8311.469.5413.5413.8316.6617.2125.291974-01-01119741
1975-01-0114.0413.5411.295.4612.585.588.128.969.295.177.7111.631975-01-01119751
1976-01-0118.3417.6714.838.0016.6210.1313.179.0413.135.7511.3814.961976-01-01119761
1977-01-0120.0411.9220.259.139.298.0410.755.889.009.0014.8825.701977-01-01119771
1978-01-018.337.127.713.548.507.5014.7110.0011.8310.0015.0920.461978-01-01119781

步骤13 对于数据记录按照月为频率取样

# 运行以下代码
data.query('day == 1')
 RPTVALROSKILSHABIRDUBCLAMULCLOBELMALdatemonthyearday
Yr_Mo_Dy                
1961-01-0115.0414.9613.179.29NaN9.8713.6710.2510.8312.5818.5015.041961-01-01119611
1961-02-0114.2515.129.045.8812.087.1710.173.636.505.509.178.001961-02-01219611
1961-03-0112.6713.1311.796.429.798.5410.2513.29NaN12.2120.62NaN1961-03-01319611
1961-04-018.386.348.336.759.339.5411.678.2111.216.4611.967.171961-04-01419611
1961-05-0115.8713.8815.379.7913.4610.179.9614.049.759.9218.6311.121961-05-01519611
1961-06-0115.929.5912.048.7911.546.049.758.299.3310.3410.6712.121961-06-01619611
1961-07-017.216.837.714.428.464.796.716.005.797.966.968.711961-07-01719611
1961-08-019.595.095.544.638.295.254.215.255.375.418.389.081961-08-01819611
1961-09-015.581.134.963.044.252.254.632.713.676.004.795.411961-09-01919611
1961-10-0114.2512.877.878.0013.007.755.839.007.085.2911.794.041961-10-011019611
1961-11-0113.2113.1314.338.5412.1710.2113.0812.1710.9213.5420.1720.041961-11-011119611
1961-12-019.677.758.003.966.002.757.252.505.585.587.7911.171961-12-011219611
1962-01-019.293.4211.543.502.211.9610.412.793.545.174.387.921962-01-01119621
1962-02-0119.1213.9612.2110.5815.7110.6315.7111.0813.1712.6217.6722.711962-02-01219621
1962-03-018.214.839.004.836.002.217.961.874.083.924.085.411962-03-01319621
1962-04-0114.3312.2511.8710.3714.9211.0019.7911.6714.0915.4616.6223.581962-04-01419621
1962-05-019.629.543.583.338.753.752.252.581.672.377.293.251962-05-01519621
1962-06-015.886.298.675.215.004.255.915.414.799.255.2510.711962-06-01619621
1962-07-018.674.176.926.718.175.6611.179.388.7511.1210.2517.081962-07-01719621
1962-08-014.585.376.042.297.873.714.462.584.004.797.217.461962-08-01819621
1962-09-0110.0012.0810.969.259.297.627.418.757.679.6214.5811.921962-09-01919621
1962-10-0114.587.8319.2110.0811.548.3813.2910.638.2112.9218.0518.121962-10-011019621
1962-11-0116.8813.2516.008.9613.4611.4610.4610.1710.3713.2114.8315.161962-11-011119621
1962-12-0118.3815.4111.756.7912.218.048.4210.835.669.0811.5011.501962-12-011219621
1963-01-0115.5913.6219.798.3812.2510.0023.4515.7113.5914.3717.5834.131963-01-01119631
1963-02-0115.417.6224.6711.429.218.1714.047.547.5410.0810.1717.671963-02-01219631
1963-03-0116.7519.6717.678.8719.0815.3716.2114.2911.299.2119.9219.791963-03-01319631
1963-04-0110.549.5912.467.339.469.5911.7911.879.7910.7113.3718.211963-04-01419631
1963-05-0118.7914.1713.5911.6314.1711.9614.4612.4612.8713.9615.2921.621963-05-01519631
1963-06-0113.376.8712.008.5010.049.4210.9212.9611.7911.0410.9213.671963-06-01619631
...................................................
1976-07-018.501.756.582.132.752.215.372.045.884.504.9610.631976-07-01719761
1976-08-0113.008.388.635.8312.928.2513.009.4210.5811.3414.2120.251976-08-01819761
1976-09-0111.8711.007.386.877.758.3310.346.4610.179.2912.7519.551976-09-01919761
1976-10-0110.966.7110.414.637.585.045.045.546.503.926.795.001976-10-011019761
1976-11-0113.9615.6710.296.4612.799.0810.009.6710.2111.6323.0921.961976-11-011119761
1976-12-0113.4616.429.214.5410.758.6710.884.838.795.918.8313.671976-12-011219761
1977-01-0120.0411.9220.259.139.298.0410.755.889.009.0014.8825.701977-01-01119771
1977-02-0111.839.7111.004.258.588.716.175.668.297.5811.7116.501977-02-01219771
1977-03-018.6314.8310.293.756.638.795.008.127.876.4213.5413.671977-03-01319771
1977-04-0121.6716.0017.3313.5920.8315.9625.6217.6219.4120.6724.3730.091977-04-01419771
1977-05-016.427.128.673.584.584.006.756.133.334.5019.2112.381977-05-01519771
1977-06-017.085.259.712.832.213.505.291.422.000.925.215.631977-06-01619771
1977-07-0115.4116.2917.086.2511.8311.8312.2910.5810.417.2117.377.831977-07-01719771
1977-08-014.332.964.422.330.961.084.961.872.332.0410.509.831977-08-01819771
1977-09-0117.3716.3316.838.5814.4611.8315.0913.9213.2913.8823.2925.171977-09-01919771
1977-10-0116.7515.3412.259.4216.3811.3818.5013.9214.0914.4622.3429.671977-10-011019771
1977-11-0116.7111.5412.174.178.547.1711.126.468.256.2111.0415.631977-11-011119771
1977-12-0113.3710.9212.422.375.796.138.967.386.295.718.5412.421977-12-011219771
1978-01-018.337.127.713.548.507.5014.7110.0011.8310.0015.0920.461978-01-01119781
1978-02-0127.2524.2118.1617.4627.5418.0520.9625.0420.0417.5027.7121.121978-02-01219781
1978-03-0115.046.2116.047.876.426.6712.298.0010.589.335.4117.001978-03-01319781
1978-04-013.427.582.711.383.462.082.674.754.831.677.3313.671978-04-01419781
1978-05-0110.5412.219.085.2911.0010.0811.1713.7511.8711.7912.8727.161978-05-01519781
1978-06-0110.3711.426.466.0411.257.506.465.967.795.465.5010.411978-06-01619781
1978-07-0112.4610.6311.176.7512.929.0412.429.6212.088.0414.0416.171978-07-01719781
1978-08-0119.3315.0920.178.8312.6210.419.3312.339.509.9215.7518.001978-08-01819781
1978-09-018.426.139.875.253.215.717.253.507.336.507.6215.961978-09-01919781
1978-10-019.506.8310.503.886.134.584.216.506.386.5410.6314.091978-10-011019781
1978-11-0113.5916.7511.257.0811.048.338.1711.2910.7511.2523.1325.001978-11-011119781
1978-12-0121.2916.2924.0412.7918.2119.2921.5417.2116.7117.8317.7525.701978-12-011219781

216 rows × 16 columns

返回练习题索引

练习7-可视化

探索泰坦尼克灾难数据

返回练习题索引

步骤1 导入必要的库

# 运行以下代码
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

步骤2 从以下地址导入数据

# 运行以下代码
path7 = '../input/pandas_exercise/exercise_data/train.csv'  # train.csv

步骤3 将数据框命名为titanic

# 运行以下代码
titanic = pd.read_csv(path7)
titanic.head()
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

步骤4 将PassengerId设置为索引

# 运行以下代码
titanic.set_index('PassengerId').head()
 SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId           
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS

步骤5 绘制一个展示男女乘客比例的扇形图

# 运行以下代码
# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()

# put them into a list called proportions
proportions = [males, females]

# Create a pie chart
plt.pie(
    # using proportions
    proportions,
    
    # with the labels being officer names
    labels = ['Males', 'Females'],
    
    # with no shadows
    shadow = False,
    
    # with colors
    colors = ['blue','red'],
    
    # with one slide exploded out
    explode = (0.15 , 0),
    
    # with the start angle at 90%
    startangle = 90,
    
    # with the percent listed as a fraction
    autopct = '%1.1f%%'
    )

# View the plot drop above
plt.axis('equal')

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()
plt.show()

步骤6 绘制一个展示船票Fare, 与乘客年龄和性别的散点图

# 运行以下代码
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)

# set title
lm.set(title = 'Fare x Age')

# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)
(-5, 85)

步骤7 有多少人生还?

# 运行以下代码
titanic.Survived.sum()
342

步骤8 绘制一个展示船票价格的直方图

# 运行以下代码
# sort the values from the top to the least value and slice the first 5 items
df = titanic.Fare.sort_values(ascending = False)
df

# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal

# create the plot
plt.hist(df, bins = binsVal)

# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')

# show the plot
plt.show()

返回练习题索引

练习8-创建数据框

探索Pokemon数据

image description

返回练习题索引

步骤1 导入必要的库

# 运行以下代码
import pandas as pd

步骤2 创建一个数据字典

# 运行以下代码
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']                        
            }

步骤3 将数据字典存为一个名叫pokemon的数据框中

# 运行以下代码
pokemon = pd.DataFrame(raw_data)
pokemon.head()
 evolutionhpnamepokedextype
0Ivysaur45Bulbasauryesgrass
1Charmeleon39Charmandernofire
2Wartortle44Squirtleyeswater
3Metapod45Caterpienobug

步骤4 数据框的列排序是字母顺序,请重新修改为name, type, hp, evolution, pokedex这个顺序

# 运行以下代码
pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon
 nametypehpevolutionpokedex
0Bulbasaurgrass45Ivysauryes
1Charmanderfire39Charmeleonno
2Squirtlewater44Wartortleyes
3Caterpiebug45Metapodno

步骤5 添加一个列place

# 运行以下代码
pokemon['place'] = ['park','street','lake','forest']
pokemon
 nametypehpevolutionpokedexplace
0Bulbasaurgrass45Ivysauryespark
1Charmanderfire39Charmeleonnostreet
2Squirtlewater44Wartortleyeslake
3Caterpiebug45Metapodnoforest

步骤6 查看每个列的数据类型

# 运行以下代码
pokemon.dtypes
name         object
type         object
hp            int64
evolution    object
pokedex      object
place        object
dtype: object

返回练习题索引

练习9-时间序列

探索Apple公司股价数据

image description

返回练习题索引

步骤1 导入必要的库

# 运行以下代码
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt

%matplotlib inline

步骤2 数据集地址

# 运行以下代码
path9 = '../input/pandas_exercise/exercise_data/Apple_stock.csv'   # Apple_stock.csv

步骤3 读取数据并存为一个名叫apple的数据框

# 运行以下代码
apple = pd.read_csv(path9)
apple.head()
 DateOpenHighLowCloseVolumeAdj Close
02014-07-0896.2796.8093.9295.356513000095.35
12014-07-0794.1495.9994.1095.975630540095.97
22014-07-0393.6794.1093.2094.032289180094.03
32014-07-0293.8794.0693.0993.482842090093.48
42014-07-0193.5294.0793.1393.523817020093.52

步骤4 查看每一列的数据类型

# 运行以下代码
apple.dtypes
Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume         int64
Adj Close    float64
dtype: object

步骤5 将Date这个列转换为datetime类型

# 运行以下代码
apple.Date = pd.to_datetime(apple.Date)
apple['Date'].head()
0   2014-07-08
1   2014-07-07
2   2014-07-03
3   2014-07-02
4   2014-07-01
Name: Date, dtype: datetime64[ns]

步骤6 将Date设置为索引

# 运行以下代码
apple = apple.set_index('Date')
apple.head()
 OpenHighLowCloseVolumeAdj Close
Date      
2014-07-0896.2796.8093.9295.356513000095.35
2014-07-0794.1495.9994.1095.975630540095.97
2014-07-0393.6794.1093.2094.032289180094.03
2014-07-0293.8794.0693.0993.482842090093.48
2014-07-0193.5294.0793.1393.523817020093.52

步骤7 有重复的日期吗?

# 运行以下代码
apple.index.is_unique
True

步骤8 将index设置为升序

# 运行以下代码
apple.sort_index(ascending = True).head()
 OpenHighLowCloseVolumeAdj Close
Date      
1980-12-1228.7528.8728.7528.751172584000.45
1980-12-1527.3827.3827.2527.25439712000.42
1980-12-1625.3725.3725.2525.25264320000.39
1980-12-1725.8726.0025.8725.87216104000.40
1980-12-1826.6326.7526.6326.63183624000.41

步骤9 找到每个月的最后一个交易日(business day)

# 运行以下代码
apple_month = apple.resample('BM')
apple_month.head()
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:3: FutureWarning: 
.resample() is now a deferred operation
You called head(...) on this deferred object which materialized it into a dataframe
by implicitly taking the mean.  Use .resample(...).mean() instead
  This is separate from the ipykernel package so we can avoid doing imports until
 OpenHighLowCloseVolumeAdj Close
Date      
1980-12-3130.48153830.56769230.44307730.4430772.586252e+070.473077
1981-01-3031.75476231.82666731.65476231.6547627.249867e+060.493810
1981-02-2726.48000026.57210526.40789526.4078954.231832e+060.411053
1981-03-3124.93772725.01681824.83636424.8363647.962691e+060.387727
1981-04-3027.28666727.36809527.22714327.2271436.392000e+060.423333

步骤10 数据集中最早的日期和最晚的日期相差多少天?

# 运行以下代码
(apple.index.max() - apple.index.min()).days
12261

步骤11 在数据中一共有多少个月?

# 运行以下代码
apple_months = apple.resample('BM').mean()
len(apple_months.index)
404

步骤12 按照时间顺序可视化Adj Close

# 运行以下代码
# makes the plot and assign it to a variable
appl_open = apple['Adj Close'].plot(title = "Apple Stock")

# changes the size of the graph
fig = appl_open.get_figure()
fig.set_size_inches(13.5, 9)

返回练习题索引

练习10-删除数据

探索Iris纸鸢花数据

步骤1 导入必要的库

# 运行以下代码
import pandas as pd

步骤2 数据集地址

# 运行以下代码
path10 ='../input/pandas_exercise/exercise_data/iris.csv'   # iris.csv

步骤3 将数据集存成变量iris

# 运行以下代码
iris = pd.read_csv(path10)
iris.head()
 5.13.51.40.2Iris-setosa
04.93.01.40.2Iris-setosa
14.73.21.30.2Iris-setosa
24.63.11.50.2Iris-setosa
35.03.61.40.2Iris-setosa
45.43.91.70.4Iris-setosa

步骤4 创建数据框的列名称

iris = pd.read_csv(path10,names = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])
iris.head()
 sepal_lengthsepal_widthpetal_lengthpetal_widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa

步骤5 数据框中有缺失值吗?

# 运行以下代码
pd.isnull(iris).sum()
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64

步骤6 将列petal_length的第10到19行设置为缺失值

# 运行以下代码
iris.iloc[10:20,2:3] = np.nan
iris.head(20)
 sepal_lengthsepal_widthpetal_lengthpetal_widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
55.43.91.70.4Iris-setosa
64.63.41.40.3Iris-setosa
75.03.41.50.2Iris-setosa
84.42.91.40.2Iris-setosa
94.93.11.50.1Iris-setosa
105.43.7NaN0.2Iris-setosa
114.83.4NaN0.2Iris-setosa
124.83.0NaN0.1Iris-setosa
134.33.0NaN0.1Iris-setosa
145.84.0NaN0.2Iris-setosa
155.74.4NaN0.4Iris-setosa
165.43.9NaN0.4Iris-setosa
175.13.5NaN0.3Iris-setosa
185.73.8NaN0.3Iris-setosa
195.13.8NaN0.3Iris-setosa

步骤7 将缺失值全部替换为1.0

# 运行以下代码
iris.petal_length.fillna(1, inplace = True)
iris
 sepal_lengthsepal_widthpetal_lengthpetal_widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
55.43.91.70.4Iris-setosa
64.63.41.40.3Iris-setosa
75.03.41.50.2Iris-setosa
84.42.91.40.2Iris-setosa
94.93.11.50.1Iris-setosa
105.43.71.00.2Iris-setosa
114.83.41.00.2Iris-setosa
124.83.01.00.1Iris-setosa
134.33.01.00.1Iris-setosa
145.84.01.00.2Iris-setosa
155.74.41.00.4Iris-setosa
165.43.91.00.4Iris-setosa
175.13.51.00.3Iris-setosa
185.73.81.00.3Iris-setosa
195.13.81.00.3Iris-setosa
205.43.41.70.2Iris-setosa
215.13.71.50.4Iris-setosa
224.63.61.00.2Iris-setosa
235.13.31.70.5Iris-setosa
244.83.41.90.2Iris-setosa
255.03.01.60.2Iris-setosa
265.03.41.60.4Iris-setosa
275.23.51.50.2Iris-setosa
285.23.41.40.2Iris-setosa
294.73.21.60.2Iris-setosa
..................
1206.93.25.72.3Iris-virginica
1215.62.84.92.0Iris-virginica
1227.72.86.72.0Iris-virginica
1236.32.74.91.8Iris-virginica
1246.73.35.72.1Iris-virginica
1257.23.26.01.8Iris-virginica
1266.22.84.81.8Iris-virginica
1276.13.04.91.8Iris-virginica
1286.42.85.62.1Iris-virginica
1297.23.05.81.6Iris-virginica
1307.42.86.11.9Iris-virginica
1317.93.86.42.0Iris-virginica
1326.42.85.62.2Iris-virginica
1336.32.85.11.5Iris-virginica
1346.12.65.61.4Iris-virginica
1357.73.06.12.3Iris-virginica
1366.33.45.62.4Iris-virginica
1376.43.15.51.8Iris-virginica
1386.03.04.81.8Iris-virginica
1396.93.15.42.1Iris-virginica
1406.73.15.62.4Iris-virginica
1416.93.15.12.3Iris-virginica
1425.82.75.11.9Iris-virginica
1436.83.25.92.3Iris-virginica
1446.73.35.72.5Iris-virginica
1456.73.05.22.3Iris-virginica
1466.32.55.01.9Iris-virginica
1476.53.05.22.0Iris-virginica
1486.23.45.42.3Iris-virginica
1495.93.05.11.8Iris-virginica

150 rows × 5 columns

步骤8 删除列class

# 运行以下代码
del iris['class']
iris.head()
 sepal_lengthsepal_widthpetal_lengthpetal_width
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2

步骤9 将数据框前三行设置为缺失值

# 运行以下代码
iris.iloc[0:3 ,:] = np.nan
iris.head()
 sepal_lengthsepal_widthpetal_lengthpetal_width
0NaNNaNNaNNaN
1NaNNaNNaNNaN
2NaNNaNNaNNaN
34.63.11.50.2
45.03.61.40.2

步骤10 删除有缺失值的行

# 运行以下代码
iris = iris.dropna(how='any')
iris.head()
 sepal_lengthsepal_widthpetal_lengthpetal_width
34.63.11.50.2
45.03.61.40.2
55.43.91.70.4
64.63.41.40.3
75.03.41.50.2

步骤11 重新设置索引

# 运行以下代码
iris = iris.reset_index(drop = True)
iris.head()
 sepal_lengthsepal_widthpetal_lengthpetal_width
04.63.11.50.2
15.03.61.40.2
25.43.91.70.4
34.63.41.40.3
45.03.41.50.2

返回练习题索引

结语

恭喜你已经完成了这10套题目的练习。欢迎查看科赛网用户贡献的科赛项目以及科赛数据集获取更多优秀学习内容。

  • 7
    点赞
  • 65
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值