这十套练习，教你如何使用Pandas做数据分析

最新推荐文章于 2024-07-22 21:10:20 发布

curd_boy

最新推荐文章于 2024-07-22 21:10:20 发布

阅读量5.4k

点赞数 3

分类专栏：数据分析与数据挖掘文章标签：这十套练习，教你如何使用Pandas做数据分析 Pandas Pandas练习题 Pandas数据分析练手题（十题）

本文链接：https://blog.csdn.net/weixin_43746433/article/details/90454463

版权

数据分析与数据挖掘专栏收录该内容

41 篇文章 41 订阅

订阅专栏

Pandas是入门Python做数据分析所必须要掌握的一个库，本篇精选了十套练习题，帮助读者上手Python代码，完成数据集探索。
数据集下载地址：https://github.com/Rango-2017/Pandas_exercises

在这里插入图片描述
1 - 开始了解你的数据
探索Chipotle快餐数据
– 将数据集存入一个名为chipo的数据框内
– 查看前10行内容
– 数据集中有多少个列(columns)？
– 打印出全部的列名称
– 数据集的索引是怎样的？
– 被下单数最多商品(item)是什么?
– 在item_name这一列中，一共有多少种商品被下单？
– 在choice_description中，下单次数最多的商品是什么？
– 一共有多少商品被下单？
– 将item_price转换为浮点数
– 在该数据集对应的时期内，收入(revenue)是多少？
– 在该数据集对应的时期内，一共有多少订单？
– 每一单(order)对应的平均总价是多少？

import pandas as pd
data=pd.read_csv('../data/chipotle.tsv',sep='\t')
print(data.head(10))
print(data.shape)
print('columns:',data.columns)
print('info\n',data.info())
print('index\n',data.index)
print(data['item_name'].value_counts().head(1)) ###查看重复值的个数
print(data['item_name'].nunique()) ##查看商品数
print(data['choice_description'].value_counts().head())
print('商品下单数量',data['quantity'].sum()) ##商品下单数量

dollarizer = lambda x: float(x[1:-1])
data.item_price = data.item_price.apply(dollarizer)
print('item_price转换为浮点数',data.item_price.head()) #将item_price转换为浮点数

print('收入',data.item_price.sum())#在该数据集对应的时期内，收入(revenue)是多少¶
print('订单数',data['order_id'].value_counts().count())# 在该数据集对应的时期内，一共有多少订单？
print('每一单(order)对应的平均总价是:',data.groupby(by=['order_id']).sum().mean()['item_price'])
print(' 一共有多少种不同的商品被售出',data['item_name'].value_counts().count())

2 - 数据过滤与排序
探索2012欧洲杯数据
– 将数据集命名为euro12
– 只选取 Goals 这一列
– 有多少球队参与了2012欧洲杯？
– 该数据集中一共有多少列(columns)?
– 将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框
– 对数据框discipline按照先Red Cards再Yellow Cards进行排序
– 计算每个球队拿到的黄牌数的平均值
– 找到进球数Goals超过6的球队数据
– 选取以字母G开头的球队数据
– 选取前7列
– 选取除了最后3列之外的全部列
– 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

import pandas as pd
data=pd.read_csv('../data/Euro2012.csv')
print(data.shape)
print('columns:',data.columns)
print('info',data.info())

discipline = data[['Team', 'Yellow Cards', 'Red Cards']]
print(discipline)
print(discipline.sort_values(['Red Cards','Yellow Cards'],ascending = False))#对数据框discipline按照先Red Cards再Yellow Cards进行排序
print('计算每个球队拿到的黄牌数的平均值',round(discipline['Yellow Cards'].mean()))#计算每个球队拿到的黄牌数的平均值
print('找到进球数Goals超过6的球队数据¶',data[data['Goals']>6])
print(data[data['Team'].str.startswith('G')])
print(data.loc[data['Team'].isin(['England','Italy','Russia']),['Team','Shooting Accuracy'] ])

练习3-数据分组
探索酒类消费数据
– 将数据框命名为drinks
– 哪个大陆(continent)平均消耗的啤酒(beer)更多？
– 打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值
– 打印出每个大陆每种酒类别的消耗平均值
– 打印出每个大陆每种酒类别的消耗中位数
– 打印出每个大陆对spirit饮品消耗的平均值，最大值和最小值

import pandas as pd
drinks=pd.read_csv('../data/drinks.csv')
print(drinks.shape)
#print(drinks.values)
print(drinks.columns)
print(drinks.index)
####哪个大陆(continent)平均消耗的啤酒(beer)更多
print(drinks[['continent','beer_servings']].groupby(by=['continent']).mean().sort_values(by=['beer_servings'],ascending=False).head(1))
#打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值
print(drinks.groupby('continent').wine_servings.describe())
#打印出每个大陆每种酒类别的消耗中位数
print(drinks.groupby(['continent']).median())
##打印出每个大陆对spirit饮品消耗的平均值，最大值和最小值
print(drinks.groupby(['continent']).spirit_servings.agg(['mean','max','min']))

练习4-Apply函数
探索1960 - 2014 美国犯罪数据
– 将数据框命名为crime
– 每一列(column)的数据类型是什么样的？
– 将Year的数据类型转换为 datetime64
– 将列Year设置为数据框的索引
– 删除名为Total的列
– 按照Year（每十年）对数据框进行分组并求和
– 何时是美国历史上生存最危险的年代？

import pandas as pd
import numpy as np
data=pd.read_csv('../data/US_Crime_Rates_1960_2014.csv')
print(data.shape)
print(data.index)
print(data.head())
print(data.columns)
data.Year = pd.to_datetime(data.Year, format='%Y')
print(data['Year'].head())
data = data.set_index('Year', drop = True)
print(data.head())
data=data.drop(labels='Total',axis=1)
print(data.shape)

练习5-合并
探索虚拟姓名数据
– 创建DataFrame
– 将上述的DataFrame分别命名为data1, data2, data3
– 将data1和data2两个数据框按照行的维度进行合并，命名为all_data
– 将data1和data2两个数据框按照列的维度进行合并，命名为all_data_col
– 打印data3
– 按照subject_id的值对all_data和data3作合并
– 对data1和data2按照subject_id作连接
– 找到 data1 和 data2 合并之后的所有匹配结果

import numpy as np
import pandas as pd
raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
print(raw_data_1)
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
print(data1)
all_data=pd.concat([data1,data2])
print(all_data)

all_data_col = pd.concat([data1, data2], axis = 1)
print(all_data_col)

#print(pd.concat([all_data,data3]))
print(pd.merge(all_data, data3, on='subject_id'))
print(pd.merge(data1, data2, on='subject_id', how='inner'))
print(pd.merge(data1, data2, on='subject_id', how='outer'))

练习6-统计
探索风速数据
– 将数据作存储并且设置前三列为合适的索引
– 2061年？我们真的有这一年的数据？创建一个函数并用它去修复这个bug
– 将日期设为索引，注意数据类型，应该是datetime64[ns]
– 对应每一个location，一共有多少数据值缺失
– 对应每一个location，一共有多少完整的数据值
– 对于全体数据，计算风速的平均值
– 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值，最大值，平均值和标准差
– 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值，最大值，平均值和标准差
– 对于每一个location，计算一月份的平均风速
– 对于数据记录按照年为频率取样
– 对于数据记录按照月为频率取样

import pandas as pd
import datetime
df = pd.read_csv('../data/wind.csv',sep='\s+',parse_dates=[[0,1,2]])
print(df.shape)
print(df.values)
print(df.columns)
def fix_century(x):
    year = x.year - 100 if x.year>1999 else x.year
    return datetime.date(year,x.month,x.day)
df['Yr_Mo_Dy'] = df['Yr_Mo_Dy'].agg(fix_century)
#将日期设为索引，注意数据类型，应该是datetime64[ns]
df['Yr_Mo_Dy'] = pd.to_datetime(df['Yr_Mo_Dy'])
print(df.head())
df = df.set_index('Yr_Mo_Dy')
print('\n',df.head())

print('#对应每一个location，一共有多少数据值缺失\n',df.isnull().sum())
print(df.shape[1] - df.isnull().sum())

#对于全体数据，计算风速的平均值
print(df.mean().mean())

#创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值，最大值，平均值和标准差
loc_stats = pd.DataFrame()
loc_stats['min'] = df.min()
loc_stats['max'] = df.max()
loc_stats['mean'] = df.mean()
loc_stats['std'] = df.std()
print(df.min())


#创建一个名为day_stats的数据框去计算并存储所有天的风速最小值，最大值，平均值和标准差
day_stats = pd.DataFrame()
day_stats['min'] = df.min(axis=1)
day_stats['max'] = df.max(axis=1)
day_stats['mean'] = df.mean(axis=1)
day_stats['std'] = df.std(axis=1)
print(df.min(axis=1).head())

#对于每一个location，计算一月份的平均风速
df['date'] = df.index
df['year'] = df['date'].apply(lambda df: df.year)
df['month'] = df['date'].apply(lambda df: df.month)
df['day'] = df['date'].apply(lambda df: df.day)
print('df.year:',df.year.head())

january_winds = df[df.month==1]
print(january_winds.loc[:,'RPT':'MAL'].mean())

print(df.query('month ==1 and day == 1').head())
print(df.query('day == 1').head())

练习7-可视化
探索泰坦尼克灾难数据
– 将数据框命名为titanic
– 将PassengerId设置为索引
– 绘制一个展示男女乘客比例的扇形图
– 绘制一个展示船票Fare, 与乘客年龄和性别的散点图
– 有多少人生还？
– 绘制一个展示船票价格的直方图

import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import numpy as np

#将数据框命名为titanic
titanic = pd.read_csv('../data/train.csv')
print(titanic.shape)
print(titanic.columns)
print(titanic.head())
#将PassengerId设置为索引
titanic = titanic.set_index('PassengerId')
print(titanic.head())

#绘制一个展示男女乘客比例的扇形图
Male=(titanic.Sex=='male').sum()
Female = (titanic.Sex == 'female').sum()
proportions = [Male,Female]
plt.pie(proportions, labels=['Male','Female'],shadow=True,
        autopct='%1.1f%%',startangle=90,explode=(0.15,0))
plt.axis('scaled')
plt.title('Sex Proportion')
plt.tight_layout()##自动调整子图参数，使之填充整个图像区域
plt.show()

#绘制一个展示船票Fare, 与乘客年龄和性别的散点图
#lm = sns.lmplot(x='Age',y='Fare', data=titanic,hue='Sex',fit_reg=False)
#lm.set(title='Fare x Age')

#绘制一个展示船票价格的直方图
df = titanic.Fare.sort_values(ascending = False)
print(df,df.shape)

plt.hist(df,bins = (np.arange(0,600,10)))
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')
plt.show()

练习8-创建数据框
探索Pokemon数据

– 创建一个数据字典
– 将数据字典存为一个名叫pokemon的数据框中
– 数据框的列排序是字母顺序，请重新修改为name, type, hp, evolution, pokedex这个顺序
– 添加一个列place[‘park’,‘street’,‘lake’,‘forest’]
– 查看每个列的数据类型

import pandas as pd
#创建一个数据字典
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']
            }
pokemon=pd.DataFrame(raw_data)
print(pokemon.columns)
pokemon = pokemon[['name', 'type', 'hp', 'evolution', 'pokedex']]
print(pokemon.columns)

#添加一个列place['park','street','lake','forest']
pokemon['place'] = ['park','street','lake','forest']

#看每个列的数据类型
print(pokemon.dtypes)

练习9-时间序列
探索Apple公司股价数据
– 读取数据并存为一个名叫apple的数据框
– 查看每一列的数据类型
– 将Date这个列转换为datetime类型
– 将Date设置为索引
– 有重复的日期吗？
– 将index设置为升序
– 找到每个月的最后一个交易日(business day)
– 数据集中最早的日期和最晚的日期相差多少天？
– 在数据中一共有多少个月？
– 按照时间顺序可视化Adj Close值

import pandas as pd
apple=pd.read_csv('../data/appl_1980_2014.csv')
print(apple.shape)
print(apple.columns)
print(apple.head())
apple.Date=pd.to_datetime(apple.Date)
apple=apple.set_index('Date')##有重复的日期吗？
print(apple.index.is_unique)

#将index设置为升序
apple=apple.sort_index(ascending=True)
print(apple.head())

#找到每个月的最后一个交易日(business day)
apple_month = apple.resample('BM').mean()
print('每个月的最后一个交易日:',apple_month.head())
print('#数据集中最早的日期和最晚的日期相差多少天？',
(apple.index.max() - apple.index.min()).days)

#在数据中一共有多少个月？
print(len(apple_month))

#按照时间顺序可视化Adj Close值
apple['Adj Close'].plot(title = 'Apple Stock').get_figure().set_size_inches(9,5)

练习10-删除数据
探索Iris纸鸢花数据
– 将数据集存成变量iris
– 创建数据框的列名称[‘sepal_length’,‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘class’]
– 数据框中有缺失值吗？
– 将列petal_length的第10到19行设置为缺失值
– 将petal_lengt缺失值全部替换为1.0
– 删除列class
– 将数据框前三行设置为缺失值
– 删除有缺失值的行
– 重新设置索引

import pandas as pd
import numpy as np
#读取数据并存为一个名叫apple的数据框
iris = pd.read_csv('../data/iris.data')
print(iris.shape,'\n')
iris.columns = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class']
print(iris.isnull().sum())
iris.petal_length.loc[10:19]=np.nan
print(iris.petal_length.loc[10:19])

iris.petal_length.fillna(1,inplace=True)
print(iris.petal_length.loc[10:19])
iris=iris.drop(labels='class',axis=1)
print(iris.shape)

#将数据框前三行设置为缺失值
iris.loc[0:2,:]=np.nan
#删除有缺失值的行
iris = iris.dropna(how='any')
print(iris.index,'\n',iris.head())
#重新设置索引
iris = iris.reset_index(drop = True)#加上drop参数，原有索引就不会成为新的列
print(iris.head())