这十套练习，教你如何用Pandas做数据分析

最新推荐文章于 2025-03-21 17:59:47 发布

hannah2sah

最新推荐文章于 2025-03-21 17:59:47 发布

阅读量5.6k

点赞数 12

分类专栏： # python(软件安装与基础学习) 文章标签：数据分析 python

本文链接：https://blog.csdn.net/weixin_43816759/article/details/120872693

版权

python(软件安装与基础学习) 专栏收录该内容

48 篇文章

订阅专栏

最新工作比较忙，python这块搁置了好久都没有好好学习以及更新相关学习笔记，立下flag，争取两天更新一个练习题，到十一月初更新完这块内容

练习1-开始了解你的数据(2021-11-02已完成）
练习2-数据过滤与排序(2021-11-02已完成）
练习3-数据分组(2021-11-02已完成）
练习4-Apply函数(2021-11-03已完成）
练习5-合并(2021-11-03已完成）
练习6-统计(2021-11-03已完成）
练习7-可视化(2021-11-05已完成）
练习8-创建数据框(2021-11-04已完成）
练习9-时间序列(2021-11-05已完成）
练习10-删除数据(2021-11-04已完成）

目前已完成这块内容，有自己想出来的解决方法，也有许多不会的地方，后续将基于最近练习的这十道题做个小小的总结汇总
目前觉得如果是单纯学习python数据分析的话，可以先看看特别基础的书，在开始练习题目可能会事半功倍

练习题下载

数据下载链接数据文件地址
通过上述链接将文件保存在相应的路径下面

习题编号	内容	相应数据集
练习1 - 开始了解你的数据	探索Chipotle快餐数据	chipotle.tsv
练习2 - 数据过滤与排序	探索2012欧洲杯数据	Euro2012_stats.csv
练习3 - 数据分组	探索酒类消费数据	drinks.csv
练习4 -Apply函数	探索1960 - 2014 美国犯罪数据	US_Crime_Rates_1960_2014.csv
练习5 - 合并	探索虚拟姓名数据	练习中手动内置的数据
练习6 - 统计	探索风速数据	wind.data
练习7 - 可视化	探索泰坦尼克灾难数据	train.csv
练习8 - 创建数据框	探索Pokemon数据	练习中手动内置的数据
练习9 - 时间序列	探索Apple公司股价数据	Apple_stock.csv
练习10 - 删除数据	探索Iris纸鸢花数据	iris.csv

查看相关数据集

1.导入对应的os库

import os

2.查看当前路径

os.getcwd()

输出

'D:\\PythonFlie\\python\\pandas'

3.查看对应路径下的文件

os.listdir( )

输出

['.ipynb_checkpoints',
 'pandas_exercise',
 'Pandas基础命令速查表0922.ipynb',
 '测试数据.csv',
 '测试数据.xlsx',
 '这十套练习，教你如何用Pandas做数据分析0929.ipynb']

4.pandas_exercise为存放数据的文件，进入该文件查看相关数据文件

os.chdir("D:\\PythonFlie\\python\\pandas\\pandas_exercise")
print(os.getcwd()) #查看是否进入对应的路径
print(os.listdir()) #查看上述路径下的文件后，发现数据在exercise_data文件下面，继续更改路径

os.chdir("D:\\PythonFlie\\python\\pandas\\pandas_exercise\\exercise_data")
print(os.getcwd()) #查看是否进入对应的路径

os.listdir() #查看该路径下的文件

输出

D:\PythonFlie\python\pandas\pandas_exercise
['exercise_data']
D:\PythonFlie\python\pandas\pandas_exercise\exercise_data
['Apple_stock.csv',
 'cars.csv',
 'chipotle.tsv',
 'drinks.csv',
 'Euro2012_stats.csv',
 'iris.csv',
 'second_cars_info.csv',
 'train.csv',
 'US_Crime_Rates_1960_2014.csv',
 'wechart.csv',
 'wind.data']

练习1-开始了解你的数据

探索Chipotle快餐数据，数据为chipotle.tsv

1. 导入必要的库

import pandas as pd

2. 获取数据集

path1 = "D:\\PythonFlie\\python\\pandas\\pandas_exercise\\exercise_data\\chipotle.tsv"    # chipotle.tsv

3. 将数据集存入一个名为chipo的数据框内

chipo = pd.read_csv(path1, sep = '\t')

4. 查看前10行内容

chipo.head(10)

输出

5.数据集中有多少个列(columns)

print(chipo.shape)  #查看数据集的行与列
print(chipo.shape[1]) #查看数据集的列数

输出

(4622, 5)
5

6.打印出全部的列名称

chipo.columns

输出

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

7.数据集的索引是怎样的

chipo.index

输出

RangeIndex(start=0, stop=4622, step=1)

8.被下单数最多商品(item)是什么?

#将chipo中的item_name和quantity两列取出来后，对item_name进行分组后对quantity进行求和
c = chipo[['item_name','quantity']].groupby(['item_name']).agg({'quantity':sum})


#对quantity列进行降序排列
c.sort_values(['quantity'],ascending=False,inplace=True)

#取前五项查看
c.head()

在这里插入图片描述

9.在item_name这一列中，一共有多少种商品被下单？

#先取出item_name这一列后去重在计算
chipo['item_name'].drop_duplicates().count()

输出

#参考答案
chipo['item_name'].nunique()

输出

10.在choice_description中，下单次数最多的商品是什么？

#思路：取出item_name以及order_id，在计算order_id数
chipo[['item_name',"choice_description","order_id"]].groupby(['item_name',"choice_description"]).aggregate({'order_id':"count"}).sort_values("order_id",ascending=False).head(1)

输出

#参考答案
chipo['choice_description'].value_counts().head()

输出

[Diet Coke]                                                                          134
[Coke]                                                                               123
[Sprite]                                                                              77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]     40
Name: choice_description, dtype: int64

11.一共有多少商品被下单？

chipo['quantity'].sum()

输出

#参考答案
total_items_orders = chipo['quantity'].sum()
total_items_orders

输出

12.将item_price转换为浮点数

#这一步很关键，转换后后续才能继续计算
dollarizer = lambda x: float(x[1:])
chipo['item_price'] = chipo['item_price'].apply(dollarizer)

13.在该数据集对应的时期内，收入(revenue)是多少

chipo["总价"] = chipo['quantity']*chipo['item_price']
chipo["总价"].sum()

输出

39237.02

#参考答案
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)
chipo['sub_total'].sum()

输出

39237.02

14.在该数据集对应的时期内，一共有多少订单？

chipo["order_id"].drop_duplicates().count()

输出

#参考答案
chipo['order_id'].nunique()

输出

15.每一单(order)对应的平均总价是多少？

#这个计算与参考答案算的不是一样的
chipo.groupby(["order_id"]).agg({"item_price":"mean"})

输出

#参考答案
chipo[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()

输出

21.394231188658654

16.一共有多少种不同的商品被售出？

chipo['item_name'].nunique()

输出

练习2-数据过滤与排序

探索2012欧洲杯数据

1.导入必要的库

import numpy as np
import pandas as pd

2. 从以下地址导入数据集，并将数据集命名为euro12

euro12 = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\Euro2012_stats.csv", sep = ',')

3.查看数据并了解数据

euro12.info()
#一共有35列，分别是
#Team  队名
#Goals 分数
#Shots on target  射中目标
#Shots off target 射门偏出
#Shooting Accuracy 射击精度
#% Goals-to-shots  进球数
#Total shots (inc. Blocked) 总投篮数（包括封盖）
#Hit Woodwork 打木工
#Penalty goals 点球进球
#Penalties not scored 未计罚分
#Headed goals 头球
#Passes 通行证
#Passes completed 通行证完成
#Passing Accuracy 传球准确率
#Touches 触摸
#Crosses 十字架
#Dribbles 运球
#Corners Taken 转角
#Tackles 铲球
#Clearances 清关
#Interceptions 拦截
#Clearances off line 下线清关
#Clean Sheets  干净的床单
#Blocks 失球
#Goals conceded  已保存
#Saves made 已保存
#Saves-to-shots ratio 比率
#Fouls Won 赢得犯规
#Fouls Conceded 承认犯规
#Offsides 越位
#Yellow Cards 黄牌
#Red Cards    红牌
#Subs on  订阅
#Subs off 订阅关闭
#Players Used 玩家使用

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Team                        16 non-null     object 
 1   Goals                       16 non-null     int64  
 2   Shots on target             16 non-null     int64  
 3   Shots off target            16 non-null     int64  
 4   Shooting Accuracy           16 non-null     object 
 5   % Goals-to-shots            16 non-null     object 
 6   Total shots (inc. Blocked)  16 non-null     int64  
 7   Hit Woodwork                16 non-null     int64  
 8   Penalty goals               16 non-null     int64  
 9   Penalties not scored        16 non-null     int64  
 10  Headed goals                16 non-null     int64  
 11  Passes                      16 non-null     int64  
 12  Passes completed            16 non-null     int64  
 13  Passing Accuracy            16 non-null     object 
 14  Touches                     16 non-null     int64  
 15  Crosses                     16 non-null     int64  
 16  Dribbles                    16 non-null     int64  
 17  Corners Taken               16 non-null     int64  
 18  Tackles                     16 non-null     int64  
 19  Clearances                  16 non-null     int64  
 20  Interceptions               16 non-null     int64  
 21  Clearances off line         15 non-null     float64
 22  Clean Sheets                16 non-null     int64  
 23  Blocks                      16 non-null     int64  
 24  Goals conceded              16 non-null     int64  
 25  Saves made                  16 non-null     int64  
 26  Saves-to-shots ratio        16 non-null     object 
 27  Fouls Won                   16 non-null     int64  
 28  Fouls Conceded              16 non-null     int64  
 29  Offsides                    16 non-null     int64  
 30  Yellow Cards                16 non-null     int64  
 31  Red Cards                   16 non-null     int64  
 32  Subs on                     16 non-null     int64  
 33  Subs off                    16 non-null     int64  
 34  Players Used                16 non-null     int64  
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB

4.只选取 Goals 这一列

euro12["Goals"]

输出

0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

#参考答案
euro12.Goals

输出

0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

5.有多少球队参与了2012欧洲杯？

euro12["Team"].count()

输出

#参考答案
euro12.shape[0]

输出

6.该数据集中一共有多少列(columns)?

euro12.shape[1]

输出

#参考答案
euro12.info()

7.将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框

discipline = euro12[["Team","Yellow Cards","Red Cards"]]
discipline

输出

8.对数据框discipline按照先Red Cards再Yellow Cards进行排序

discipline.sort_values(by = ["Red Cards","Yellow Cards"])

输出

#参考答案
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)

9.计算每个球队拿到的黄牌数的平均值

discipline["Yellow Cards"].mean()

输出

7.4375

#参考答案
round(discipline['Yellow Cards'].mean())

输出

10.找到进球数Goals超过6的球队数据

euro12[euro12["Goals"] > 6]

#参考答案
euro12[euro12.Goals > 6]

11.选取以字母G开头的球队数据

euro12[euro12.Team.str.startswith('G')]

12.选取前7列

euro12.head(7)

#参考答案
euro12.iloc[: , 0:7]

13.选取除了最后3列之外的全部列

euro12.iloc[: , :-3]

14.找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]

输出

练习3-数据分组

探索酒类消费数据

1. 导入必要的库

import pandas as pd

2. 从以下地址导入数据并将数据框命名为drinks

drinks = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\drinks.csv")
drinks

输出

3.查看数据

drinks.info()
#country 国家
#beer_servings   啤酒消耗
#spirit_servings 精神消耗
#wine_servings   红酒消耗
#total_litres_of_pure_alcohol 总升纯酒精
#continent 大陆

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     170 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB

4.哪个大陆(continent)平均消耗的啤酒(beer)更多？

drinks.groupby("continent").aggregate({"beer_servings":"mean"}).sort_values(by = "beer_servings",ascending = False)

输出

	beer_servings
continent	
EU	193.777778
SA	175.083333
OC	89.687500
AF	61.471698
AS	37.045455

#参考答案
drinks.groupby('continent').beer_servings.mean()

输出

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

5.打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值

drinks.groupby("continent").describe()["wine_servings"]

输出

	count	mean	std	min	25%	50%	75%	max
continent								
AF	53.0	16.264151	38.846419	0.0	1.0	2.0	13.00	233.0
AS	44.0	9.068182	21.667034	0.0	0.0	1.0	8.00	123.0
EU	45.0	142.222222	97.421738	0.0	59.0	128.0	195.00	370.0
OC	16.0	35.625000	64.555790	0.0	1.0	8.5	23.25	212.0
SA	12.0	62.416667	88.620189	1.0	3.0	12.0	98.50	221.0

#参考答案
drinks.groupby('continent').wine_servings.describe()

输出

count	mean	std	min	25%	50%	75%	max
continent								
AF	53.0	16.264151	38.846419	0.0	1.0	2.0	13.00	233.0
AS	44.0	9.068182	21.667034	0.0	0.0	1.0	8.00	123.0
EU	45.0	142.222222	97.421738	0.0	59.0	128.0	195.00	370.0
OC	16.0	35.625000	64.555790	0.0	1.0	8.5	23.25	212.0
SA	12.0	62.416667	88.620189	1.0	3.0	12.0	98.50	221.0

6.打印出每个大陆每种酒类别的消耗平均值

drinks.groupby("continent").aggregate({"beer_servings":"mean","spirit_servings":"mean","wine_servings":"mean"})

输出

	beer_servings	spirit_servings	wine_servings
continent			
AF	61.471698	16.339623	16.264151
AS	37.045455	60.840909	9.068182
EU	193.777778	132.555556	142.222222
OC	89.687500	58.437500	35.625000
SA	175.083333	114.750000	62.416667

#参考答案
drinks.groupby('continent').mean()

输出

beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
continent				
AF	61.471698	16.339623	16.264151	3.007547
AS	37.045455	60.840909	9.068182	2.170455
EU	193.777778	132.555556	142.222222	8.617778
OC	89.687500	58.437500	35.625000	3.381250
SA	175.083333	114.750000	62.416667	6.308333

7.打印出每个大陆每种酒类别的消耗中位数

drinks.groupby("continent").aggregate({"beer_servings":"quantile","spirit_servings":"quantile","wine_servings":"quantile"})

输出

beer_servings	spirit_servings	wine_servings
continent			
AF	32.0	3.0	2.0
AS	17.5	16.0	1.0
EU	219.0	122.0	128.0
OC	52.5	37.0	8.5
SA	162.5	108.5	12.0

#参考答案
drinks.groupby('continent').median()

输出

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
continent				
AF	32.0	3.0	2.0	2.30
AS	17.5	16.0	1.0	1.20
EU	219.0	122.0	128.0	10.00
OC	52.5	37.0	8.5	1.75
SA	162.5	108.5	12.0	6.85

8.打印出每个大陆对spirit饮品消耗的平均值，最大值和最小值

drinks.groupby("continent").describe()["spirit_servings"][["mean","max","min"]]

输出

	mean	max	min
continent			
AF	16.339623	152.0	0.0
AS	60.840909	326.0	0.0
EU	132.555556	373.0	0.0
OC	58.437500	254.0	0.0
SA	114.750000	302.0	25.0

#参考答案
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])

输出

mean	min	max
continent			
AF	16.339623	0	152
AS	60.840909	0	326
EU	132.555556	0	373
OC	58.437500	0	254
SA	114.750000	25	302

练习4-Apply函数

探索1960 - 2014 美国犯罪数据

1. 导入必要的库

import pandas as pd

2. 从以下地址导入数据集并将数据框命名为crime

crime = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\US_Crime_Rates_1960_2014.csv")

3.了解数据

crime.head()
#Year 年份
#Population 人口
#Total 总计
#Violent 暴力
#Property 财产
#Murder 谋杀
#Forcible_Rape 强暴
#Robbery 抢劫
#Aggravated_assault 严重袭击
#Burglary 入室盗窃
#Larceny_Theft 盗窃盗窃
#Vehicle_Theft 车辆盗窃

输出

4.每一列(column)的数据类型是什么样的？

crime.info()

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Year                55 non-null     int64
 1   Population          55 non-null     int64
 2   Total               55 non-null     int64
 3   Violent             55 non-null     int64
 4   Property            55 non-null     int64
 5   Murder              55 non-null     int64
 6   Forcible_Rape       55 non-null     int64
 7   Robbery             55 non-null     int64
 8   Aggravated_assault  55 non-null     int64
 9   Burglary            55 non-null     int64
 10  Larceny_Theft       55 non-null     int64
 11  Vehicle_Theft       55 non-null     int64
dtypes: int64(12)
memory usage: 5.3 KB

5.将Year的数据类型转换为 datetime64

crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Year                55 non-null     datetime64[ns]
 1   Population          55 non-null     int64         
 2   Total               55 non-null     int64         
 3   Violent             55 non-null     int64         
 4   Property            55 non-null     int64         
 5   Murder              55 non-null     int64         
 6   Forcible_Rape       55 non-null     int64         
 7   Robbery             55 non-null     int64         
 8   Aggravated_assault  55 non-null     int64         
 9   Burglary            55 non-null     int64         
 10  Larceny_Theft       55 non-null     int64         
 11  Vehicle_Theft       55 non-null     int64         
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB

6.将列Year设置为数据框的索引

crime.set_index("Year",inplace = True)
crime.head()

输出

#参考答案
crime = crime.set_index('Year', drop = True)
crime.head()

7.删除名为Total的列

crime.drop(columns = ["Total"],inplace = True)

#参考答案
del crime['Total']
crime.head()

在这里插入图片描述

8.按照Year对数据框进行分组并求和

crime.groupby("Year").sum().head()

输出

这块不太明白

#参考答案
# 更多关于 .resample 的介绍
# (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)
# 更多关于 Offset Aliases的介绍 
# (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
# 运行以下代码
crimes = crime.resample('10AS').sum() # resample a time series per decades


# 用resample去得到“Population”列的最大值
population = crime['Population'].resample('10AS').max()

# 更新 "Population" 
crimes['Population'] = population

crimes

9.何时是美国历史上生存最危险的年代？

crime.idxmax(0)

输出

Population            2014
Total                 1991
Violent               1992
Property              1991
Murder                1991
Forcible_Rape         1992
Robbery               1991
Aggravated_assault    1993
Burglary              1980
Larceny_Theft         1991
Vehicle_Theft         1991
dtype: int64

练习5-合并

探索虚拟姓名数据

1.导入必要的库

import pandas as pd

2.按照如下的元数据内容创建数据框

raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

3.将上述的数据框分别命名为data1, data2, data3

data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
print(data1)
print("----------")
print(data2)
print("----------")
print(data3)

输出

  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
----------
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan
----------
  subject_id  test_id
0          1       51
1          2       15
2          3       15
3          4       61
4          5       16
5          7       14
6          8       15
7          9        1
8         10       61
9         11       16

4.将data1和data2两个数据框按照行的维度进行合并，命名为all_data

#重新设置索引
all_data = pd.concat([data1,data2])
all_data

输出

subject_id	first_name	last_name
0	1	Alex	Anderson
1	2	Amy	Ackerman
2	3	Allen	Ali
3	4	Alice	Aoni
4	5	Ayoung	Atiches
0	4	Billy	Bonder
1	5	Brian	Black
2	6	Bran	Balwner
3	7	Bryce	Brice
4	8	Betty	Btisan

5.将data1和data2两个数据框按照列的维度进行合并，命名为all_data_col

all_data_col = pd.concat([data1, data2], axis = 1)
all_data_col

输出

subject_id	first_name	last_name	subject_id	first_name	last_name
0	1	Alex	Anderson	4	Billy	Bonder
1	2	Amy	Ackerman	5	Brian	Black
2	3	Allen	Ali	6	Bran	Balwner
3	4	Alice	Aoni	7	Bryce	Brice
4	5	Ayoung	Atiches	8	Betty	Btisan

all_data_col = pd.merge(data1, data2,left_index = True,right_index = True)
all_data_col

输出

	subject_id_x	first_name_x	last_name_x	subject_id_y	first_name_y	last_name_y
0	1	Alex	Anderson	4	Billy	Bonder
1	2	Amy	Ackerman	5	Brian	Black
2	3	Allen	Ali	6	Bran	Balwner
3	4	Alice	Aoni	7	Bryce	Brice
4	5	Ayoung	Atiches	8	Betty	Btisan

6.打印data3

data3

输出

	subject_id	test_id
0	1	51
1	2	15
2	3	15
3	4	61
4	5	16
5	7	14
6	8	15
7	9	1
8	10	61
9	11	16

7.按照subject_id的值对all_data和data3作合并

pd.merge(all_data,data3,on = "subject_id")

输出

	subject_id	first_name	last_name	test_id
0	1	Alex	Anderson	51
1	2	Amy	Ackerman	15
2	3	Allen	Ali	15
3	4	Alice	Aoni	61
4	4	Billy	Bonder	61
5	5	Ayoung	Atiches	16
6	5	Brian	Black	16
7	7	Bryce	Brice	14
8	8	Betty	Btisan	15

8.对data1和data2按照subject_id作连接

pd.merge(data1,data2,on = "subject_id")

输出

subject_id	first_name_x	last_name_x	first_name_y	last_name_y
0	4	Alice	Aoni	Billy	Bonder
1	5	Ayoung	Atiches	Brian	Black

9.步骤9 找到 data1 和 data2 合并之后的所有匹配结果

pd.merge(data1,data2,how = "outer",on = "subject_id")

输出

subject_id	first_name_x	last_name_x	first_name_y	last_name_y
0	1	Alex	Anderson	NaN	NaN
1	2	Amy	Ackerman	NaN	NaN
2	3	Allen	Ali	NaN	NaN
3	4	Alice	Aoni	Billy	Bonder
4	5	Ayoung	Atiches	Brian	Black
5	6	NaN	NaN	Bran	Balwner
6	7	NaN	NaN	Bryce	Brice
7	8	NaN	NaN	Betty	Btisan

练习6-统计

探索风速数据

1.导入必要的库

import pandas as pd
import time
import datetime
import dateutil

2.从以下地址导入数据

3.将数据作存储并且设置前三列为合适的索引

data = pd.read_table(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\wind.data",sep = "\s+",parse_dates = [[0,1,2]])
data.head()

#read_csv()方法指定parse_dates会使得读取csv文件的时间大大增加
#infer_datetime_format=True可显著减少read_csv命令日期解析时间
#keep_date_col=True/False参数则是用来指定解析为日期格式的列是否保留下来，True保留，False不保留

输出

4.2061年？我们真的有这一年的数据？创建一个函数并用它去修复这个bug

#在原有的日期上减去100年
data["Yr_Mo_Dy"] = data["Yr_Mo_Dy"].apply(lambda x :x - dateutil.relativedelta.relativedelta(years=100))
data.head()

输出

#参考答案
# 运行以下代码
def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)

# data.info()
data.head()

5.将日期设为索引，注意数据类型，应该是datetime64[ns]

data.set_index("Yr_Mo_Dy",drop = True,inplace = True)
data.head()

输出

#参考答案
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])

# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')

data.head()
# data.info()

6.对应每一个location，一共有多少数据值缺失

data.info()

输出

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6574 entries, 1961-01-01 to 1978-12-31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RPT     6568 non-null   float64
 1   VAL     6571 non-null   float64
 2   ROS     6572 non-null   float64
 3   KIL     6569 non-null   float64
 4   SHA     6572 non-null   float64
 5   BIR     6574 non-null   float64
 6   DUB     6571 non-null   float64
 7   CLA     6572 non-null   float64
 8   MUL     6571 non-null   float64
 9   CLO     6573 non-null   float64
 10  BEL     6574 non-null   float64
 11  MAL     6570 non-null   float64
dtypes: float64(12)
memory usage: 667.7 KB

#参考答案
data.isnull().sum()

输出

RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

7.对应每一个location，一共有多少完整的数据值

data.shape[0] - data.isnull().sum()

输出

RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

8.对于全体数据，计算风速的平均值

data.mean().mean()

输出

10.227982360836924

9.创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值，最大值，平均值和标准差

loc_stats = data.aggregate(["min","max","mean","std"])
loc_stats

输出

#参考答案
loc_stats = pd.DataFrame()

loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max 
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations

loc_stats

输出

10.创建一个名为day_stats的数据框去计算并存储所有location的风速最小值，最大值，平均值和标准差

day_stats = data.aggregate(["min","max","mean","std"],axis=1)
day_stats.head()

输出

min	max	mean	std
1961-01-01	9.29	18.50	13.018182	2.808875
1961-01-02	6.50	17.54	11.336364	3.188994
1961-01-03	6.17	18.50	11.641818	3.681912
1961-01-04	1.79	11.75	6.619167	3.198126
1961-01-05	6.17	13.33	10.630000	2.445356

#参考答案
# create the dataframe
day_stats = pd.DataFrame()

# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max 
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations

day_stats.head()

输出

	min	max	mean	std
Yr_Mo_Dy				
1961-01-01	9.29	18.50	13.018182	2.808875
1961-01-02	6.50	17.54	11.336364	3.188994
1961-01-03	6.17	18.50	11.641818	3.681912
1961-01-04	1.79	11.75	6.619167	3.198126
1961-01-05	6.17	13.33	10.630000	2.445356

11.对于每一个location，计算一月份的平均风速

注意，1961年的1月和1962年的1月应该区别对待

# creates a new column 'date' and gets the values from the index
data['date'] = data.index


# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)

# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')

# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()

输出

RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

12.对于数据记录按照年为频率取样

data.query('month == 1 and day == 1')

输出

13.对于数据记录按照月为频率取样

data.query('day == 1')

输出

练习7-可视化

探索泰坦尼克灾难数据

1.导入必要的库

import pandas as pd

2.从以下地址导入数据

3.将数据框命名为titanic

titanic = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\train.csv")
titanic.head()

#PassengerId 乘客id
#Survived 是否生还
#Pclass 
#Name 姓名
#Sex 性别
#Age 年龄
#SibSp
#Parch
#Ticket 船票号
#Fare 船票
#Cabin 舱
#Embarked 登船

输出

4.将PassengerId设置为索引

titanic.set_index("PassengerId",inplace = True)
titanic.head()

输出

5.绘制一个展示男女乘客比例的扇形图

import matplotlib.pyplot as plt

male_ct = titanic[titanic["Sex"] == "male"].count()[1]
female_ct = titanic[titanic["Sex"] == "female"].count()[1]

x = [male_ct,female_ct]

plt.pie(x,
        labels = ["male","female"],
        explode = (0.1 , 0),
        startangle = 90,
        autopct = '%1.1f%%')

plt.axis('equal')

plt.title("Sex Proportion")

plt.tight_layout()
plt.show()

输出

# 参考答案
# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()

# put them into a list called proportions
proportions = [males, females]

# Create a pie chart
plt.pie(
    # using proportions
    proportions,
    
    # with the labels being officer names
    labels = ['Males', 'Females'],
    
    # with no shadows
    shadow = False,
    
    # with colors
    colors = ['blue','red'],
    
    # with one slide exploded out
    explode = (0.15 , 0),
    
    # with the start angle at 90%
    startangle = 90,
    
    # with the percent listed as a fraction
    autopct = '%1.1f%%')

# View the plot drop above
plt.axis('equal')

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()
plt.show()

输出

6.绘制一个展示船票Fare, 与乘客年龄和性别的散点图

import seaborn as sns
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)

# set title
lm.set(title = 'Fare x Age')

# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)

输出

7.有多少人生还？

titanic.query("Survived == 1").Survived.count()

输出

#参考答案
titanic.Survived.sum()

输出

8.绘制一个展示船票价格的直方图

fare = titanic["Fare"]

plt.hist(fare,
         bins = 20)

plt.axis('tight')
plt.tight_layout()
plt.show()

输出

#参考答案
# sort the values from the top to the least value and slice the first 5 items
import numpy as np
df = titanic.Fare.sort_values(ascending = False)
df

# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal

# create the plot
plt.hist(df, bins = binsVal)

# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')

# show the plot
plt.show()

输出

练习8-创建数据框

探索Pokemon数据

1.导入必要的库

import pandas as pd

2.创建一个数据字典

raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']                        
            }

3.将数据字典存为一个名叫pokemon的数据框中

pokemon = pd.DataFrame(raw_data)
pokemon.head()

输出

name	evolution	type	hp	pokedex
0	Bulbasaur	Ivysaur	grass	45	yes
1	Charmander	Charmeleon	fire	39	no
2	Squirtle	Wartortle	water	44	yes
3	Caterpie	Metapod	bug	45	no

4.数据框的列排序是字母顺序，请重新修改为name, type, hp, evolution, pokedex这个顺序

pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon

输出

name	type	hp	evolution	pokedex
0	Bulbasaur	grass	45	Ivysaur	yes
1	Charmander	fire	39	Charmeleon	no
2	Squirtle	water	44	Wartortle	yes
3	Caterpie	bug	45	Metapod	no

5.添加一个列place

pokemon['place'] = ['park','street','lake','forest']
pokemon

输出

name	type	hp	evolution	pokedex	place
0	Bulbasaur	grass	45	Ivysaur	yes	park
1	Charmander	fire	39	Charmeleon	no	street
2	Squirtle	water	44	Wartortle	yes	lake
3	Caterpie	bug	45	Metapod	no	forest

6.查看每个列的数据类型

pokemon.info()

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       4 non-null      object
 1   type       4 non-null      object
 2   hp         4 non-null      int64 
 3   evolution  4 non-null      object
 4   pokedex    4 non-null      object
 5   place      4 non-null      object
dtypes: int64(1), object(5)
memory usage: 320.0+ bytes

#参考答案
pokemon.dtypes

输出

name         object
type         object
hp            int64
evolution    object
pokedex      object
place        object
dtype: object

练习9-时间序列

探索Apple公司股价数据

1.导入必要的库

import pandas as pd

2.数据集地址

3.读取数据并存为一个名叫apple的数据框

apple = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\Apple_stock.csv")
apple.head()

输出

4.查看每一列的数据类型

apple.info()

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8465 entries, 0 to 8464
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       8465 non-null   object 
 1   Open       8465 non-null   float64
 2   High       8465 non-null   float64
 3   Low        8465 non-null   float64
 4   Close      8465 non-null   float64
 5   Volume     8465 non-null   int64  
 6   Adj Close  8465 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 463.1+ KB

5.将Date这个列转换为datetime类型

apple["Date"] = apple["Date"].apply(lambda x : pd.to_datetime(x))
apple.info()

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8465 entries, 0 to 8464
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       8465 non-null   datetime64[ns]
 1   Open       8465 non-null   float64       
 2   High       8465 non-null   float64       
 3   Low        8465 non-null   float64       
 4   Close      8465 non-null   float64       
 5   Volume     8465 non-null   int64         
 6   Adj Close  8465 non-null   float64       
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 463.1 KB

6.将Date设置为索引

apple.set_index("Date",drop = True,inplace = True)
apple.head()

输出

#参考答案
apple = apple.set_index('Date')
apple.head()

7.有重复的日期吗？

apple.groupby("Date").count().sort_values(by = "Date",ascending = False)

输出

#参考答案
apple.index.is_unique

输出

True

8.将index设置为升序

apple.sort_values(by = "Date",ascending = True)

输出

#参考答案
apple.sort_index(ascending = True).head()

输出

9.找到每个月的最后一个交易日(business day)

apple_month = apple.resample('BM')
apple_month.agg("mean")

输出

10.数据集中最早的日期和最晚的日期相差多少天？

(apple.index.max() - apple.index.min()).days

输出

11.在数据中一共有多少个月？

apple_months = apple.resample('BM').mean()
len(apple_months.index)

输出

13.按照时间顺序可视化Adj Close值

# makes the plot and assign it to a variable
appl_open = apple['Adj Close'].plot(title = "Apple Stock")

# changes the size of the graph
fig = appl_open.get_figure()
fig.set_size_inches(13.5, 9)

输出

练习10-删除数据

探索Iris纸鸢花数据

1.导入必要的库

import pandas as pd
import numpy as np

2.数据集地址

3.将数据集存成变量iris

iris = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\iris.csv")
iris.head()

输出

	5.1	3.5	1.4	0.2	Iris-setosa
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa

4.创建数据框的列名称

#iris.columns = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class']
iris.rename(columns={'5.1':'sepal_length','3.5':'sepal_width','1.4':'petal_length','0.2':'petal_width','Iris-setosa':'class'},inplace = True)
iris.head()

输出

	sepal_length	sepal_width	petal_length	petal_width	class
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa

5.数据框中有缺失值吗？

iris.isnull().sum()

输出

5.1            0
3.5            0
1.4            0
0.2            0
Iris-setosa    0
dtype: int64

6.将列petal_length的第10到19行设置为缺失值

iris["petal_length"][9:19] = np.nan

#参考答案
iris.iloc[10:20,2:3] = np.nan
iris.head(20)

输出

7.将缺失值全部替换为1.0

iris.fillna(1.0,inplace = True)

#参考答案
iris.petal_length.fillna(1, inplace = True)
iris.head(20)

输出

8.删除列class

iris.drop(columns="class",inplace = True)

#参考答案
del iris['class']
iris.head()

输出

sepal_length	sepal_width	petal_length	petal_width
0	4.9	3.0	1.4	0.2
1	4.7	3.2	1.3	0.2
2	4.6	3.1	1.5	0.2
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

9.将数据框前三行设置为缺失值

iris[:3] = np.nan
iris.head(20)

#参考答案
iris.iloc[0:3 ,:] = np.nan
iris.head()

输出

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

10.删除有缺失值的行

iris.dropna(inplace=True)

#参考答案
iris = iris.dropna(how='any')
iris.head()

输出

sepal_length	sepal_width	petal_length	petal_width
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4
5	4.6	3.4	1.4	0.3
6	5.0	3.4	1.5	0.2
7	4.4	2.9	1.4	0.2

11.重新设置索引

iris.reset_index()

输出

#参考答案
iris = iris.reset_index(drop = True)
iris.head()

输出