最新工作比较忙,python这块搁置了好久都没有好好学习以及更新相关学习笔记,立下flag,争取两天更新一个练习题,到十一月初更新完这块内容
- 练习1-开始了解你的数据(2021-11-02已完成)
- 练习2-数据过滤与排序(2021-11-02已完成)
- 练习3-数据分组(2021-11-02已完成)
- 练习4-Apply函数(2021-11-03已完成)
- 练习5-合并(2021-11-03已完成)
- 练习6-统计(2021-11-03已完成)
- 练习7-可视化(2021-11-05已完成)
- 练习8-创建数据框(2021-11-04已完成)
- 练习9-时间序列(2021-11-05已完成)
- 练习10-删除数据(2021-11-04已完成)
目前已完成这块内容,有自己想出来的解决方法,也有许多不会的地方,后续将基于最近练习的这十道题做个小小的总结汇总
目前觉得如果是单纯学习python数据分析的话,可以先看看特别基础的书,在开始练习题目可能会事半功倍
练习题下载
- 数据下载链接数据文件地址
- 通过上述链接将文件保存在相应的路径下面
习题编号 | 内容 | 相应数据集 |
---|---|---|
练习1 - 开始了解你的数据 | 探索Chipotle快餐数据 | chipotle.tsv |
练习2 - 数据过滤与排序 | 探索2012欧洲杯数据 | Euro2012_stats.csv |
练习3 - 数据分组 | 探索酒类消费数据 | drinks.csv |
练习4 -Apply函数 | 探索1960 - 2014 美国犯罪数据 | US_Crime_Rates_1960_2014.csv |
练习5 - 合并 | 探索虚拟姓名数据 | 练习中手动内置的数据 |
练习6 - 统计 | 探索风速数据 | wind.data |
练习7 - 可视化 | 探索泰坦尼克灾难数据 | train.csv |
练习8 - 创建数据框 | 探索Pokemon数据 | 练习中手动内置的数据 |
练习9 - 时间序列 | 探索Apple公司股价数据 | Apple_stock.csv |
练习10 - 删除数据 | 探索Iris纸鸢花数据 | iris.csv |
查看相关数据集
1.导入对应的os库
import os
2.查看当前路径
os.getcwd()
- 输出
'D:\\PythonFlie\\python\\pandas'
3.查看对应路径下的文件
os.listdir( )
- 输出
['.ipynb_checkpoints',
'pandas_exercise',
'Pandas基础命令速查表0922.ipynb',
'测试数据.csv',
'测试数据.xlsx',
'这十套练习,教你如何用Pandas做数据分析0929.ipynb']
4.pandas_exercise为存放数据的文件,进入该文件查看相关数据文件
os.chdir("D:\\PythonFlie\\python\\pandas\\pandas_exercise")
print(os.getcwd()) #查看是否进入对应的路径
print(os.listdir()) #查看上述路径下的文件后,发现数据在exercise_data文件下面,继续更改路径
os.chdir("D:\\PythonFlie\\python\\pandas\\pandas_exercise\\exercise_data")
print(os.getcwd()) #查看是否进入对应的路径
os.listdir() #查看该路径下的文件
- 输出
D:\PythonFlie\python\pandas\pandas_exercise
['exercise_data']
D:\PythonFlie\python\pandas\pandas_exercise\exercise_data
['Apple_stock.csv',
'cars.csv',
'chipotle.tsv',
'drinks.csv',
'Euro2012_stats.csv',
'iris.csv',
'second_cars_info.csv',
'train.csv',
'US_Crime_Rates_1960_2014.csv',
'wechart.csv',
'wind.data']
练习1-开始了解你的数据
- 探索Chipotle快餐数据,数据为chipotle.tsv
1. 导入必要的库
import pandas as pd
2. 获取数据集
path1 = "D:\\PythonFlie\\python\\pandas\\pandas_exercise\\exercise_data\\chipotle.tsv" # chipotle.tsv
3. 将数据集存入一个名为chipo的数据框内
chipo = pd.read_csv(path1, sep = '\t')
4. 查看前10行内容
chipo.head(10)
- 输出
5.数据集中有多少个列(columns)
print(chipo.shape) #查看数据集的行与列
print(chipo.shape[1]) #查看数据集的列数
- 输出
(4622, 5)
5
6.打印出全部的列名称
chipo.columns
- 输出
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
7.数据集的索引是怎样的
chipo.index
- 输出
RangeIndex(start=0, stop=4622, step=1)
8.被下单数最多商品(item)是什么?
#将chipo中的item_name和quantity两列取出来后,对item_name进行分组后对quantity进行求和
c = chipo[['item_name','quantity']].groupby(['item_name']).agg({'quantity':sum})
#对quantity列进行降序排列
c.sort_values(['quantity'],ascending=False,inplace=True)
#取前五项查看
c.head()
9.在item_name这一列中,一共有多少种商品被下单?
#先取出item_name这一列后去重在计算
chipo['item_name'].drop_duplicates().count()
- 输出
50
#参考答案
chipo['item_name'].nunique()
- 输出
50
10.在choice_description中,下单次数最多的商品是什么?
#思路:取出item_name以及order_id,在计算order_id数
chipo[['item_name',"choice_description","order_id"]].groupby(['item_name',"choice_description"]).aggregate({'order_id':"count"}).sort_values("order_id",ascending=False).head(1)
- 输出
#参考答案
chipo['choice_description'].value_counts().head()
- 输出
[Diet Coke] 134
[Coke] 123
[Sprite] 77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]] 42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]] 40
Name: choice_description, dtype: int64
11.一共有多少商品被下单?
chipo['quantity'].sum()
- 输出
4972
#参考答案
total_items_orders = chipo['quantity'].sum()
total_items_orders
- 输出
4972
12.将item_price转换为浮点数
#这一步很关键,转换后后续才能继续计算
dollarizer = lambda x: float(x[1:])
chipo['item_price'] = chipo['item_price'].apply(dollarizer)
13.在该数据集对应的时期内,收入(revenue)是多少
chipo["总价"] = chipo['quantity']*chipo['item_price']
chipo["总价"].sum()
- 输出
39237.02
#参考答案
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)
chipo['sub_total'].sum()
- 输出
39237.02
14.在该数据集对应的时期内,一共有多少订单?
chipo["order_id"].drop_duplicates().count()
- 输出
1834
#参考答案
chipo['order_id'].nunique()
- 输出
1834
15.每一单(order)对应的平均总价是多少?
#这个计算与参考答案算的不是一样的
chipo.groupby(["order_id"]).agg({"item_price":"mean"})
- 输出
#参考答案
chipo[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()
- 输出
21.394231188658654
16.一共有多少种不同的商品被售出?
chipo['item_name'].nunique()
- 输出
50
练习2-数据过滤与排序
- 探索2012欧洲杯数据
1.导入必要的库
import numpy as np
import pandas as pd
2. 从以下地址导入数据集,并将数据集命名为euro12
euro12 = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\Euro2012_stats.csv", sep = ',')
3.查看数据并了解数据
euro12.info()
#一共有35列,分别是
#Team 队名
#Goals 分数
#Shots on target 射中目标
#Shots off target 射门偏出
#Shooting Accuracy 射击精度
#% Goals-to-shots 进球数
#Total shots (inc. Blocked) 总投篮数(包括封盖)
#Hit Woodwork 打木工
#Penalty goals 点球进球
#Penalties not scored 未计罚分
#Headed goals 头球
#Passes 通行证
#Passes completed 通行证完成
#Passing Accuracy 传球准确率
#Touches 触摸
#Crosses 十字架
#Dribbles 运球
#Corners Taken 转角
#Tackles 铲球
#Clearances 清关
#Interceptions 拦截
#Clearances off line 下线清关
#Clean Sheets 干净的床单
#Blocks 失球
#Goals conceded 已保存
#Saves made 已保存
#Saves-to-shots ratio 比率
#Fouls Won 赢得犯规
#Fouls Conceded 承认犯规
#Offsides 越位
#Yellow Cards 黄牌
#Red Cards 红牌
#Subs on 订阅
#Subs off 订阅关闭
#Players Used 玩家使用
- 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Team 16 non-null object
1 Goals 16 non-null int64
2 Shots on target 16 non-null int64
3 Shots off target 16 non-null int64
4 Shooting Accuracy 16 non-null object
5 % Goals-to-shots 16 non-null object
6 Total shots (inc. Blocked) 16 non-null int64
7 Hit Woodwork 16 non-null int64
8 Penalty goals 16 non-null int64
9 Penalties not scored 16 non-null int64
10 Headed goals 16 non-null int64
11 Passes 16 non-null int64
12 Passes completed 16 non-null int64
13 Passing Accuracy 16 non-null object
14 Touches 16 non-null int64
15 Crosses 16 non-null int64
16 Dribbles 16 non-null int64
17 Corners Taken 16 non-null int64
18 Tackles 16 non-null int64
19 Clearances 16 non-null int64
20 Interceptions 16 non-null int64
21 Clearances off line 15 non-null float64
22 Clean Sheets 16 non-null int64
23 Blocks 16 non-null int64
24 Goals conceded 16 non-null int64
25 Saves made 16 non-null int64
26 Saves-to-shots ratio 16 non-null object
27 Fouls Won 16 non-null int64
28 Fouls Conceded 16 non-null int64
29 Offsides 16 non-null int64
30 Yellow Cards 16 non-null int64
31 Red Cards 16 non-null int64
32 Subs on 16 non-null int64
33 Subs off 16 non-null int64
34 Players Used 16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB
4.只选取 Goals 这一列
euro12["Goals"]
- 输出
0 4
1 4
2 4
3 5
4 3
5 10
6 5
7 6
8 2
9 2
10 6
11 1
12 5
13 12
14 5
15 2
Name: Goals, dtype: int64
#参考答案
euro12.Goals
- 输出
0 4
1 4
2 4
3 5
4 3
5 10
6 5
7 6
8 2
9 2
10 6
11 1
12 5
13 12
14 5
15 2
Name: Goals, dtype: int64
5.有多少球队参与了2012欧洲杯?
euro12["Team"].count()
- 输出
16
#参考答案
euro12.shape[0]
- 输出
16
6.该数据集中一共有多少列(columns)?
euro12.shape[1]
- 输出
35
#参考答案
euro12.info()
7.将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框
discipline = euro12[["Team","Yellow Cards","Red Cards"]]
discipline
- 输出
8.对数据框discipline按照先Red Cards再Yellow Cards进行排序
discipline.sort_values(by = ["Red Cards","Yellow Cards"])
- 输出
#参考答案
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)
9.计算每个球队拿到的黄牌数的平均值
discipline["Yellow Cards"].mean()
- 输出
7.4375
#参考答案
round(discipline['Yellow Cards'].mean())
- 输出
7
10.找到进球数Goals超过6的球队数据
euro12[euro12["Goals"] > 6]
#参考答案
euro12[euro12.Goals > 6]
11.选取以字母G开头的球队数据
euro12[euro12.Team.str.startswith('G')]
12.选取前7列
euro12.head(7)
#参考答案
euro12.iloc[: , 0:7]
13.选取除了最后3列之外的全部列
euro12.iloc[: , :-3]
14.找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
- 输出
练习3-数据分组
- 探索酒类消费数据
1. 导入必要的库
import pandas as pd
2. 从以下地址导入数据并将数据框命名为drinks
drinks = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\drinks.csv")
drinks
- 输出
3.查看数据
drinks.info()
#country 国家
#beer_servings 啤酒消耗
#spirit_servings 精神消耗
#wine_servings 红酒消耗
#total_litres_of_pure_alcohol 总升纯酒精
#continent 大陆
- 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 193 non-null object
1 beer_servings 193 non-null int64
2 spirit_servings 193 non-null int64
3 wine_servings 193 non-null int64
4 total_litres_of_pure_alcohol 193 non-null float64
5 continent 170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB
4.哪个大陆(continent)平均消耗的啤酒(beer)更多?
drinks.groupby("continent").aggregate({"beer_servings":"mean"}).sort_values(by = "beer_servings",ascending = False)
- 输出
beer_servings
continent
EU 193.777778
SA 175.083333
OC 89.687500
AF 61.471698
AS 37.045455
#参考答案
drinks.groupby('continent').beer_servings.mean()
- 输出
continent
AF 61.471698
AS 37.045455
EU 193.777778
OC 89.687500
SA 175.083333
Name: beer_servings, dtype: float64
5.打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值
drinks.groupby("continent").describe()["wine_servings"]
- 输出
count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
#参考答案
drinks.groupby('continent').wine_servings.describe()
- 输出
count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
6.打印出每个大陆每种酒类别的消耗平均值
drinks.groupby("continent").aggregate({"beer_servings":"mean","spirit_servings":"mean","wine_servings":"mean"})
- 输出
beer_servings spirit_servings wine_servings
continent
AF 61.471698 16.339623 16.264151
AS 37.045455 60.840909 9.068182
EU 193.777778 132.555556 142.222222
OC 89.687500 58.437500 35.625000
SA 175.083333 114.750000 62.416667
#参考答案
drinks.groupby('continent').mean()
- 输出
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 61.471698 16.339623 16.264151 3.007547
AS 37.045455 60.840909 9.068182 2.170455
EU 193.777778 132.555556 142.222222 8.617778
OC 89.687500 58.437500 35.625000 3.381250
SA 175.083333 114.750000 62.416667 6.308333
7.打印出每个大陆每种酒类别的消耗中位数
drinks.groupby("continent").aggregate({"beer_servings":"quantile","spirit_servings":"quantile","wine_servings":"quantile"})
- 输出
beer_servings spirit_servings wine_servings
continent
AF 32.0 3.0 2.0
AS 17.5 16.0 1.0
EU 219.0 122.0 128.0
OC 52.5 37.0 8.5
SA 162.5 108.5 12.0
#参考答案
drinks.groupby('continent').median()
- 输出
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 32.0 3.0 2.0 2.30
AS 17.5 16.0 1.0 1.20
EU 219.0 122.0 128.0 10.00
OC 52.5 37.0 8.5 1.75
SA 162.5 108.5 12.0 6.85
8.打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值
drinks.groupby("continent").describe()["spirit_servings"][["mean","max","min"]]
- 输出
mean max min
continent
AF 16.339623 152.0 0.0
AS 60.840909 326.0 0.0
EU 132.555556 373.0 0.0
OC 58.437500 254.0 0.0
SA 114.750000 302.0 25.0
#参考答案
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
- 输出
mean min max
continent
AF 16.339623 0 152
AS 60.840909 0 326
EU 132.555556 0 373
OC 58.437500 0 254
SA 114.750000 25 302
练习4-Apply函数
- 探索1960 - 2014 美国犯罪数据
1. 导入必要的库
import pandas as pd
2. 从以下地址导入数据集并将数据框命名为crime
crime = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\US_Crime_Rates_1960_2014.csv")
3.了解数据
crime.head()
#Year 年份
#Population 人口
#Total 总计
#Violent 暴力
#Property 财产
#Murder 谋杀
#Forcible_Rape 强暴
#Robbery 抢劫
#Aggravated_assault 严重袭击
#Burglary 入室盗窃
#Larceny_Theft 盗窃盗窃
#Vehicle_Theft 车辆盗窃
- 输出
4.每一列(column)的数据类型是什么样的?
crime.info()
- 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 55 non-null int64
1 Population 55 non-null int64
2 Total 55 non-null int64
3 Violent 55 non-null int64
4 Property 55 non-null int64
5 Murder 55 non-null int64
6 Forcible_Rape 55 non-null int64
7 Robbery 55 non-null int64
8 Aggravated_assault 55 non-null int64
9 Burglary 55 non-null int64
10 Larceny_Theft 55 non-null int64
11 Vehicle_Theft 55 non-null int64
dtypes: int64(12)
memory usage: 5.3 KB
5.将Year的数据类型转换为 datetime64
crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()
- 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 55 non-null datetime64[ns]
1 Population 55 non-null int64
2 Total 55 non-null int64
3 Violent 55 non-null int64
4 Property 55 non-null int64
5 Murder 55 non-null int64
6 Forcible_Rape 55 non-null int64
7 Robbery 55 non-null int64
8 Aggravated_assault 55 non-null int64
9 Burglary 55 non-null int64
10 Larceny_Theft 55 non-null int64
11 Vehicle_Theft 55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB
6.将列Year设置为数据框的索引
crime.set_index("Year",inplace = True)
crime.head()
- 输出
#参考答案
crime = crime.set_index('Year', drop = True)
crime.head()
7.删除名为Total的列
crime.drop(columns = ["Total"],inplace = True)
#参考答案
del crime['Total']
crime.head()
8.按照Year对数据框进行分组并求和
crime.groupby("Year").sum().head()
- 输出
这块不太明白
#参考答案
# 更多关于 .resample 的介绍
# (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html)
# 更多关于 Offset Aliases的介绍
# (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
# 运行以下代码
crimes = crime.resample('10AS').sum() # resample a time series per decades
# 用resample去得到“Population”列的最大值
population = crime['Population'].resample('10AS').max()
# 更新 "Population"
crimes['Population'] = population
crimes
9.何时是美国历史上生存最危险的年代?
crime.idxmax(0)
- 输出
Population 2014
Total 1991
Violent 1992
Property 1991
Murder 1991
Forcible_Rape 1992
Robbery 1991
Aggravated_assault 1993
Burglary 1980
Larceny_Theft 1991
Vehicle_Theft 1991
dtype: int64
练习5-合并
- 探索虚拟姓名数据
1.导入必要的库
import pandas as pd
2.按照如下的元数据内容创建数据框
raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
3.将上述的数据框分别命名为data1, data2, data3
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
print(data1)
print("----------")
print(data2)
print("----------")
print(data3)
- 输出
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
----------
subject_id first_name last_name
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
----------
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
4.将data1和data2两个数据框按照行的维度进行合并,命名为all_data
#重新设置索引
all_data = pd.concat([data1,data2])
all_data
- 输出
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
5.将data1和data2两个数据框按照列的维度进行合并,命名为all_data_col
all_data_col = pd.concat([data1, data2], axis = 1)
all_data_col
- 输出
subject_id first_name last_name subject_id first_name last_name
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan
all_data_col = pd.merge(data1, data2,left_index = True,right_index = True)
all_data_col
- 输出
subject_id_x first_name_x last_name_x subject_id_y first_name_y last_name_y
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan
6.打印data3
data3
- 输出
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
7.按照subject_id的值对all_data和data3作合并
pd.merge(all_data,data3,on = "subject_id")
- 输出
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15
8.对data1和data2按照subject_id作连接
pd.merge(data1,data2,on = "subject_id")
- 输出
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black
9.步骤9 找到 data1 和 data2 合并之后的所有匹配结果
pd.merge(data1,data2,how = "outer",on = "subject_id")
- 输出
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
5 6 NaN NaN Bran Balwner
6 7 NaN NaN Bryce Brice
7 8 NaN NaN Betty Btisan
练习6-统计
- 探索风速数据
1.导入必要的库
import pandas as pd
import time
import datetime
import dateutil
2.从以下地址导入数据
3.将数据作存储并且设置前三列为合适的索引
data = pd.read_table(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\wind.data",sep = "\s+",parse_dates = [[0,1,2]])
data.head()
#read_csv()方法指定parse_dates会使得读取csv文件的时间大大增加
#infer_datetime_format=True可显著减少read_csv命令日期解析时间
#keep_date_col=True/False参数则是用来指定解析为日期格式的列是否保留下来,True保留,False不保留
- 输出
4.2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug
#在原有的日期上减去100年
data["Yr_Mo_Dy"] = data["Yr_Mo_Dy"].apply(lambda x :x - dateutil.relativedelta.relativedelta(years=100))
data.head()
- 输出
#参考答案
# 运行以下代码
def fix_century(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
# data.info()
data.head()
5.将日期设为索引,注意数据类型,应该是datetime64[ns]
data.set_index("Yr_Mo_Dy",drop = True,inplace = True)
data.head()
- 输出
#参考答案
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])
# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')
data.head()
# data.info()
6.对应每一个location,一共有多少数据值缺失
data.info()
- 输出
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6574 entries, 1961-01-01 to 1978-12-31
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RPT 6568 non-null float64
1 VAL 6571 non-null float64
2 ROS 6572 non-null float64
3 KIL 6569 non-null float64
4 SHA 6572 non-null float64
5 BIR 6574 non-null float64
6 DUB 6571 non-null float64
7 CLA 6572 non-null float64
8 MUL 6571 non-null float64
9 CLO 6573 non-null float64
10 BEL 6574 non-null float64
11 MAL 6570 non-null float64
dtypes: float64(12)
memory usage: 667.7 KB
#参考答案
data.isnull().sum()
- 输出
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
7.对应每一个location,一共有多少完整的数据值
data.shape[0] - data.isnull().sum()
- 输出
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
8.对于全体数据,计算风速的平均值
data.mean().mean()
- 输出
10.227982360836924
9.创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差
loc_stats = data.aggregate(["min","max","mean","std"])
loc_stats
- 输出
#参考答案
loc_stats = pd.DataFrame()
loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations
loc_stats
- 输出
10.创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差
day_stats = data.aggregate(["min","max","mean","std"],axis=1)
day_stats.head()
- 输出
min max mean std
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356
#参考答案
# create the dataframe
day_stats = pd.DataFrame()
# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations
day_stats.head()
- 输出
min max mean std
Yr_Mo_Dy
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356
11.对于每一个location,计算一月份的平均风速
- 注意,1961年的1月和1962年的1月应该区别对待
# creates a new column 'date' and gets the values from the index
data['date'] = data.index
# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)
# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')
# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()
- 输出
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
12.对于数据记录按照年为频率取样
data.query('month == 1 and day == 1')
- 输出
13.对于数据记录按照月为频率取样
data.query('day == 1')
- 输出
练习7-可视化
- 探索泰坦尼克灾难数据
1.导入必要的库
import pandas as pd
2.从以下地址导入数据
3.将数据框命名为titanic
titanic = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\train.csv")
titanic.head()
#PassengerId 乘客id
#Survived 是否生还
#Pclass
#Name 姓名
#Sex 性别
#Age 年龄
#SibSp
#Parch
#Ticket 船票号
#Fare 船票
#Cabin 舱
#Embarked 登船
- 输出
4.将PassengerId设置为索引
titanic.set_index("PassengerId",inplace = True)
titanic.head()
- 输出
5.绘制一个展示男女乘客比例的扇形图
import matplotlib.pyplot as plt
male_ct = titanic[titanic["Sex"] == "male"].count()[1]
female_ct = titanic[titanic["Sex"] == "female"].count()[1]
x = [male_ct,female_ct]
plt.pie(x,
labels = ["male","female"],
explode = (0.1 , 0),
startangle = 90,
autopct = '%1.1f%%')
plt.axis('equal')
plt.title("Sex Proportion")
plt.tight_layout()
plt.show()
- 输出
# 参考答案
# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()
# put them into a list called proportions
proportions = [males, females]
# Create a pie chart
plt.pie(
# using proportions
proportions,
# with the labels being officer names
labels = ['Males', 'Females'],
# with no shadows
shadow = False,
# with colors
colors = ['blue','red'],
# with one slide exploded out
explode = (0.15 , 0),
# with the start angle at 90%
startangle = 90,
# with the percent listed as a fraction
autopct = '%1.1f%%')
# View the plot drop above
plt.axis('equal')
# Set labels
plt.title("Sex Proportion")
# View the plot
plt.tight_layout()
plt.show()
- 输出
6.绘制一个展示船票Fare, 与乘客年龄和性别的散点图
import seaborn as sns
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)
# set title
lm.set(title = 'Fare x Age')
# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)
- 输出
7.有多少人生还?
titanic.query("Survived == 1").Survived.count()
- 输出
342
#参考答案
titanic.Survived.sum()
- 输出
342
8.绘制一个展示船票价格的直方图
fare = titanic["Fare"]
plt.hist(fare,
bins = 20)
plt.axis('tight')
plt.tight_layout()
plt.show()
- 输出
#参考答案
# sort the values from the top to the least value and slice the first 5 items
import numpy as np
df = titanic.Fare.sort_values(ascending = False)
df
# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal
# create the plot
plt.hist(df, bins = binsVal)
# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')
# show the plot
plt.show()
- 输出
练习8-创建数据框
- 探索Pokemon数据
1.导入必要的库
import pandas as pd
2.创建一个数据字典
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
"evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
"type": ['grass', 'fire', 'water', 'bug'],
"hp": [45, 39, 44, 45],
"pokedex": ['yes', 'no','yes','no']
}
3.将数据字典存为一个名叫pokemon的数据框中
pokemon = pd.DataFrame(raw_data)
pokemon.head()
- 输出
name evolution type hp pokedex
0 Bulbasaur Ivysaur grass 45 yes
1 Charmander Charmeleon fire 39 no
2 Squirtle Wartortle water 44 yes
3 Caterpie Metapod bug 45 no
4.数据框的列排序是字母顺序,请重新修改为name, type, hp, evolution, pokedex这个顺序
pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon
- 输出
name type hp evolution pokedex
0 Bulbasaur grass 45 Ivysaur yes
1 Charmander fire 39 Charmeleon no
2 Squirtle water 44 Wartortle yes
3 Caterpie bug 45 Metapod no
5.添加一个列place
pokemon['place'] = ['park','street','lake','forest']
pokemon
- 输出
name type hp evolution pokedex place
0 Bulbasaur grass 45 Ivysaur yes park
1 Charmander fire 39 Charmeleon no street
2 Squirtle water 44 Wartortle yes lake
3 Caterpie bug 45 Metapod no forest
6.查看每个列的数据类型
pokemon.info()
- 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4 non-null object
1 type 4 non-null object
2 hp 4 non-null int64
3 evolution 4 non-null object
4 pokedex 4 non-null object
5 place 4 non-null object
dtypes: int64(1), object(5)
memory usage: 320.0+ bytes
#参考答案
pokemon.dtypes
- 输出
name object
type object
hp int64
evolution object
pokedex object
place object
dtype: object
练习9-时间序列
- 探索Apple公司股价数据
1.导入必要的库
import pandas as pd
2.数据集地址
3.读取数据并存为一个名叫apple的数据框
apple = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\Apple_stock.csv")
apple.head()
- 输出
4.查看每一列的数据类型
apple.info()
- 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8465 entries, 0 to 8464
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 8465 non-null object
1 Open 8465 non-null float64
2 High 8465 non-null float64
3 Low 8465 non-null float64
4 Close 8465 non-null float64
5 Volume 8465 non-null int64
6 Adj Close 8465 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 463.1+ KB
5.将Date这个列转换为datetime类型
apple["Date"] = apple["Date"].apply(lambda x : pd.to_datetime(x))
apple.info()
- 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8465 entries, 0 to 8464
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 8465 non-null datetime64[ns]
1 Open 8465 non-null float64
2 High 8465 non-null float64
3 Low 8465 non-null float64
4 Close 8465 non-null float64
5 Volume 8465 non-null int64
6 Adj Close 8465 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 463.1 KB
6.将Date设置为索引
apple.set_index("Date",drop = True,inplace = True)
apple.head()
- 输出
#参考答案
apple = apple.set_index('Date')
apple.head()
7.有重复的日期吗?
apple.groupby("Date").count().sort_values(by = "Date",ascending = False)
- 输出
#参考答案
apple.index.is_unique
- 输出
True
8.将index设置为升序
apple.sort_values(by = "Date",ascending = True)
- 输出
#参考答案
apple.sort_index(ascending = True).head()
- 输出
9.找到每个月的最后一个交易日(business day)
apple_month = apple.resample('BM')
apple_month.agg("mean")
- 输出
10.数据集中最早的日期和最晚的日期相差多少天?
(apple.index.max() - apple.index.min()).days
- 输出
12261
11.在数据中一共有多少个月?
apple_months = apple.resample('BM').mean()
len(apple_months.index)
- 输出
404
13.按照时间顺序可视化Adj Close值
# makes the plot and assign it to a variable
appl_open = apple['Adj Close'].plot(title = "Apple Stock")
# changes the size of the graph
fig = appl_open.get_figure()
fig.set_size_inches(13.5, 9)
- 输出
练习10-删除数据
- 探索Iris纸鸢花数据
1.导入必要的库
import pandas as pd
import numpy as np
2.数据集地址
3.将数据集存成变量iris
iris = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\iris.csv")
iris.head()
- 输出
5.1 3.5 1.4 0.2 Iris-setosa
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
4.创建数据框的列名称
#iris.columns = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class']
iris.rename(columns={'5.1':'sepal_length','3.5':'sepal_width','1.4':'petal_length','0.2':'petal_width','Iris-setosa':'class'},inplace = True)
iris.head()
- 输出
sepal_length sepal_width petal_length petal_width class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
5.数据框中有缺失值吗?
iris.isnull().sum()
- 输出
5.1 0
3.5 0
1.4 0
0.2 0
Iris-setosa 0
dtype: int64
6.将列petal_length的第10到19行设置为缺失值
iris["petal_length"][9:19] = np.nan
#参考答案
iris.iloc[10:20,2:3] = np.nan
iris.head(20)
- 输出
7.将缺失值全部替换为1.0
iris.fillna(1.0,inplace = True)
#参考答案
iris.petal_length.fillna(1, inplace = True)
iris.head(20)
- 输出
8.删除列class
iris.drop(columns="class",inplace = True)
#参考答案
del iris['class']
iris.head()
- 输出
sepal_length sepal_width petal_length petal_width
0 4.9 3.0 1.4 0.2
1 4.7 3.2 1.3 0.2
2 4.6 3.1 1.5 0.2
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
9.将数据框前三行设置为缺失值
iris[:3] = np.nan
iris.head(20)
#参考答案
iris.iloc[0:3 ,:] = np.nan
iris.head()
- 输出
sepal_length sepal_width petal_length petal_width
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
10.删除有缺失值的行
iris.dropna(inplace=True)
#参考答案
iris = iris.dropna(how='any')
iris.head()
- 输出
sepal_length sepal_width petal_length petal_width
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
5 4.6 3.4 1.4 0.3
6 5.0 3.4 1.5 0.2
7 4.4 2.9 1.4 0.2
11.重新设置索引
iris.reset_index()
- 输出
#参考答案
iris = iris.reset_index(drop = True)
iris.head()
- 输出