Python数据分析习题（基于pandas&numpy模块）（下）

最新推荐文章于 2024-04-15 18:23:39 发布

O_Oreo

最新推荐文章于 2024-04-15 18:23:39 发布

阅读量3.3k

点赞数 2

分类专栏： Python练习文章标签： python 数据分析 pandas numpy

本文链接：https://blog.csdn.net/weixin_46960971/article/details/113815814

版权

Python练习专栏收录该内容

2 篇文章 1 订阅

订阅专栏

练习题目录

习题编号内容相应数据集

练习6 - 统计探索风速数据 wind.data
练习7 - 可视化探索泰坦尼克灾难数据 train.csv
练习8 - 创建数据框探索Pokemon数据练习中手动内置的数据
练习9 - 时间序列探索Apple公司股价数据 Apple_stock.csv
练习10 - 删除数据探索Iris纸鸢花数据 iris.csv

练习6-统计

探索风速数据

步骤1 导入必要的库

import pandas as pd
import datetime

步骤2 从以下地址导入数据

path6 = "./exercise_data/wind.data"

步骤3 将数据作存储并且设置前三列为合适的索引

import datetime

data = pd.read_table(path6, sep = "\s+", parse_dates = [[0,1,2]]) 
data.head()

	Yr_Mo_Dy	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
0	2061-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04
1	2061-01-02	14.71	NaN	10.83	6.50	12.62	7.67	11.50	10.04	9.79	9.67	17.54	13.83
2	2061-01-03	18.50	16.88	12.33	10.13	11.17	6.17	11.25	NaN	8.50	7.67	12.75	12.71
3	2061-01-04	10.58	6.63	11.75	4.58	4.54	2.88	8.63	1.79	5.83	5.88	5.46	10.88
4	2061-01-05	13.33	13.25	11.42	6.17	10.71	8.21	11.92	6.54	10.92	10.34	12.92	11.83

步骤4 2061年？我们真的有这一年的数据？创建一个函数并用它去修复这个bug

def fix_century(x):
    year = x.year - 100 if x.year > 1989 else x.year
    return datetime.date(year, x.month, x.day)

# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)

# data.info()
data.head()

	Yr_Mo_Dy	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
0	1961-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04
1	1961-01-02	14.71	NaN	10.83	6.50	12.62	7.67	11.50	10.04	9.79	9.67	17.54	13.83
2	1961-01-03	18.50	16.88	12.33	10.13	11.17	6.17	11.25	NaN	8.50	7.67	12.75	12.71
3	1961-01-04	10.58	6.63	11.75	4.58	4.54	2.88	8.63	1.79	5.83	5.88	5.46	10.88
4	1961-01-05	13.33	13.25	11.42	6.17	10.71	8.21	11.92	6.54	10.92	10.34	12.92	11.83

步骤5 将日期设为索引，注意数据类型，应该是datetime64[ns]

data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])

# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')

data.head()
# data.info()

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
Yr_Mo_Dy
1961-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04
1961-01-02	14.71	NaN	10.83	6.50	12.62	7.67	11.50	10.04	9.79	9.67	17.54	13.83
1961-01-03	18.50	16.88	12.33	10.13	11.17	6.17	11.25	NaN	8.50	7.67	12.75	12.71
1961-01-04	10.58	6.63	11.75	4.58	4.54	2.88	8.63	1.79	5.83	5.88	5.46	10.88
1961-01-05	13.33	13.25	11.42	6.17	10.71	8.21	11.92	6.54	10.92	10.34	12.92	11.83

步骤6 对应每一个location，一共有多少数据值缺失

data.isnull().sum()

RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

步骤7 对应每一个location，一共有多少完整的数据值

data.shape[0] - data.isnull().sum()

RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

步骤8 对于全体数据，计算风速的平均值

data.mean().mean()

10.227982360836924

步骤9 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值，最大值，平均值和标准差

loc_stats = pd.DataFrame()

loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max 
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations

loc_stats

	min	max	mean	std
RPT	0.67	35.80	12.362987	5.618413
VAL	0.21	33.37	10.644314	5.267356
ROS	1.50	33.84	11.660526	5.008450
KIL	0.00	28.46	6.306468	3.605811
SHA	0.13	37.54	10.455834	4.936125
BIR	0.00	26.16	7.092254	3.968683
DUB	0.00	30.37	9.797343	4.977555
CLA	0.00	31.08	8.495053	4.499449
MUL	0.00	25.88	8.493590	4.166872
CLO	0.04	28.21	8.707332	4.503954
BEL	0.13	42.38	13.121007	5.835037
MAL	0.67	42.54	15.599079	6.699794

步骤10 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值，最大值，平均值和标准差

day_stats = pd.DataFrame()

# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max 
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations

day_stats.head()

	min	max	mean	std
Yr_Mo_Dy
1961-01-01	9.29	18.50	13.018182	2.808875
1961-01-02	6.50	17.54	11.336364	3.188994
1961-01-03	6.17	18.50	11.641818	3.681912
1961-01-04	1.79	11.75	6.619167	3.198126
1961-01-05	6.17	13.33	10.630000	2.445356

步骤11 对于每一个location，计算一月份的平均风速

注意，1961年的1月和1962年的1月应该区别对待

data['date'] = data.index

# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)

# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')

# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()

RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

步骤12 对于数据记录按照年为频率取4样

data.query('month == 1 and day == 1')

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL	date	month	year	day
Yr_Mo_Dy
1961-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04	1961-01-01	1	1961	1
1962-01-01	9.29	3.42	11.54	3.50	2.21	1.96	10.41	2.79	3.54	5.17	4.38	7.92	1962-01-01	1	1962	1
1963-01-01	15.59	13.62	19.79	8.38	12.25	10.00	23.45	15.71	13.59	14.37	17.58	34.13	1963-01-01	1	1963	1
1964-01-01	25.80	22.13	18.21	13.25	21.29	14.79	14.12	19.58	13.25	16.75	28.96	21.00	1964-01-01	1	1964	1
1965-01-01	9.54	11.92	9.00	4.38	6.08	5.21	10.25	6.08	5.71	8.63	12.04	17.41	1965-01-01	1	1965	1
1966-01-01	22.04	21.50	17.08	12.75	22.17	15.59	21.79	18.12	16.66	17.83	28.33	23.79	1966-01-01	1	1966	1
1967-01-01	6.46	4.46	6.50	3.21	6.67	3.79	11.38	3.83	7.71	9.08	10.67	20.91	1967-01-01	1	1967	1
1968-01-01	30.04	17.88	16.25	16.25	21.79	12.54	18.16	16.62	18.75	17.62	22.25	27.29	1968-01-01	1	1968	1
1969-01-01	6.13	1.63	5.41	1.08	2.54	1.00	8.50	2.42	4.58	6.34	9.17	16.71	1969-01-01	1	1969	1
1970-01-01	9.59	2.96	11.79	3.42	6.13	4.08	9.00	4.46	7.29	3.50	7.33	13.00	1970-01-01	1	1970	1
1971-01-01	3.71	0.79	4.71	0.17	1.42	1.04	4.63	0.75	1.54	1.08	4.21	9.54	1971-01-01	1	1971	1
1972-01-01	9.29	3.63	14.54	4.25	6.75	4.42	13.00	5.33	10.04	8.54	8.71	19.17	1972-01-01	1	1972	1
1973-01-01	16.50	15.92	14.62	7.41	8.29	11.21	13.54	7.79	10.46	10.79	13.37	9.71	1973-01-01	1	1973	1
1974-01-01	23.21	16.54	16.08	9.75	15.83	11.46	9.54	13.54	13.83	16.66	17.21	25.29	1974-01-01	1	1974	1
1975-01-01	14.04	13.54	11.29	5.46	12.58	5.58	8.12	8.96	9.29	5.17	7.71	11.63	1975-01-01	1	1975	1
1976-01-01	18.34	17.67	14.83	8.00	16.62	10.13	13.17	9.04	13.13	5.75	11.38	14.96	1976-01-01	1	1976	1
1977-01-01	20.04	11.92	20.25	9.13	9.29	8.04	10.75	5.88	9.00	9.00	14.88	25.70	1977-01-01	1	1977	1
1978-01-01	8.33	7.12	7.71	3.54	8.50	7.50	14.71	10.00	11.83	10.00	15.09	20.46	1978-01-01	1	1978	1

步骤13 对于数据记录按照月为频率取样

data.query('day == 1')

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL	date	month	year	day
Yr_Mo_Dy
1961-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04	1961-01-01	1	1961	1
1961-02-01	14.25	15.12	9.04	5.88	12.08	7.17	10.17	3.63	6.50	5.50	9.17	8.00	1961-02-01	2	1961	1
1961-03-01	12.67	13.13	11.79	6.42	9.79	8.54	10.25	13.29	NaN	12.21	20.62	NaN	1961-03-01	3	1961	1
1961-04-01	8.38	6.34	8.33	6.75	9.33	9.54	11.67	8.21	11.21	6.46	11.96	7.17	1961-04-01	4	1961	1
1961-05-01	15.87	13.88	15.37	9.79	13.46	10.17	9.96	14.04	9.75	9.92	18.63	11.12	1961-05-01	5	1961	1
1961-06-01	15.92	9.59	12.04	8.79	11.54	6.04	9.75	8.29	9.33	10.34	10.67	12.12	1961-06-01	6	1961	1
1961-07-01	7.21	6.83	7.71	4.42	8.46	4.79	6.71	6.00	5.79	7.96	6.96	8.71	1961-07-01	7	1961	1
1961-08-01	9.59	5.09	5.54	4.63	8.29	5.25	4.21	5.25	5.37	5.41	8.38	9.08	1961-08-01	8	1961	1
1961-09-01	5.58	1.13	4.96	3.04	4.25	2.25	4.63	2.71	3.67	6.00	4.79	5.41	1961-09-01	9	1961	1
1961-10-01	14.25	12.87	7.87	8.00	13.00	7.75	5.83	9.00	7.08	5.29	11.79	4.04	1961-10-01	10	1961	1
1961-11-01	13.21	13.13	14.33	8.54	12.17	10.21	13.08	12.17	10.92	13.54	20.17	20.04	1961-11-01	11	1961	1
1961-12-01	9.67	7.75	8.00	3.96	6.00	2.75	7.25	2.50	5.58	5.58	7.79	11.17	1961-12-01	12	1961	1
1962-01-01	9.29	3.42	11.54	3.50	2.21	1.96	10.41	2.79	3.54	5.17	4.38	7.92	1962-01-01	1	1962	1
1962-02-01	19.12	13.96	12.21	10.58	15.71	10.63	15.71	11.08	13.17	12.62	17.67	22.71	1962-02-01	2	1962	1
1962-03-01	8.21	4.83	9.00	4.83	6.00	2.21	7.96	1.87	4.08	3.92	4.08	5.41	1962-03-01	3	1962	1
1962-04-01	14.33	12.25	11.87	10.37	14.92	11.00	19.79	11.67	14.09	15.46	16.62	23.58	1962-04-01	4	1962	1
1962-05-01	9.62	9.54	3.58	3.33	8.75	3.75	2.25	2.58	1.67	2.37	7.29	3.25	1962-05-01	5	1962	1
1962-06-01	5.88	6.29	8.67	5.21	5.00	4.25	5.91	5.41	4.79	9.25	5.25	10.71	1962-06-01	6	1962	1
1962-07-01	8.67	4.17	6.92	6.71	8.17	5.66	11.17	9.38	8.75	11.12	10.25	17.08	1962-07-01	7	1962	1
1962-08-01	4.58	5.37	6.04	2.29	7.87	3.71	4.46	2.58	4.00	4.79	7.21	7.46	1962-08-01	8	1962	1
1962-09-01	10.00	12.08	10.96	9.25	9.29	7.62	7.41	8.75	7.67	9.62	14.58	11.92	1962-09-01	9	1962	1
1962-10-01	14.58	7.83	19.21	10.08	11.54	8.38	13.29	10.63	8.21	12.92	18.05	18.12	1962-10-01	10	1962	1
1962-11-01	16.88	13.25	16.00	8.96	13.46	11.46	10.46	10.17	10.37	13.21	14.83	15.16	1962-11-01	11	1962	1
1962-12-01	18.38	15.41	11.75	6.79	12.21	8.04	8.42	10.83	5.66	9.08	11.50	11.50	1962-12-01	12	1962	1
1963-01-01	15.59	13.62	19.79	8.38	12.25	10.00	23.45	15.71	13.59	14.37	17.58	34.13	1963-01-01	1	1963	1
1963-02-01	15.41	7.62	24.67	11.42	9.21	8.17	14.04	7.54	7.54	10.08	10.17	17.67	1963-02-01	2	1963	1
1963-03-01	16.75	19.67	17.67	8.87	19.08	15.37	16.21	14.29	11.29	9.21	19.92	19.79	1963-03-01	3	1963	1
1963-04-01	10.54	9.59	12.46	7.33	9.46	9.59	11.79	11.87	9.79	10.71	13.37	18.21	1963-04-01	4	1963	1
1963-05-01	18.79	14.17	13.59	11.63	14.17	11.96	14.46	12.46	12.87	13.96	15.29	21.62	1963-05-01	5	1963	1
1963-06-01	13.37	6.87	12.00	8.50	10.04	9.42	10.92	12.96	11.79	11.04	10.92	13.67	1963-06-01	6	1963	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1976-07-01	8.50	1.75	6.58	2.13	2.75	2.21	5.37	2.04	5.88	4.50	4.96	10.63	1976-07-01	7	1976	1
1976-08-01	13.00	8.38	8.63	5.83	12.92	8.25	13.00	9.42	10.58	11.34	14.21	20.25	1976-08-01	8	1976	1
1976-09-01	11.87	11.00	7.38	6.87	7.75	8.33	10.34	6.46	10.17	9.29	12.75	19.55	1976-09-01	9	1976	1
1976-10-01	10.96	6.71	10.41	4.63	7.58	5.04	5.04	5.54	6.50	3.92	6.79	5.00	1976-10-01	10	1976	1
1976-11-01	13.96	15.67	10.29	6.46	12.79	9.08	10.00	9.67	10.21	11.63	23.09	21.96	1976-11-01	11	1976	1
1976-12-01	13.46	16.42	9.21	4.54	10.75	8.67	10.88	4.83	8.79	5.91	8.83	13.67	1976-12-01	12	1976	1
1977-01-01	20.04	11.92	20.25	9.13	9.29	8.04	10.75	5.88	9.00	9.00	14.88	25.70	1977-01-01	1	1977	1
1977-02-01	11.83	9.71	11.00	4.25	8.58	8.71	6.17	5.66	8.29	7.58	11.71	16.50	1977-02-01	2	1977	1
1977-03-01	8.63	14.83	10.29	3.75	6.63	8.79	5.00	8.12	7.87	6.42	13.54	13.67	1977-03-01	3	1977	1
1977-04-01	21.67	16.00	17.33	13.59	20.83	15.96	25.62	17.62	19.41	20.67	24.37	30.09	1977-04-01	4	1977	1
1977-05-01	6.42	7.12	8.67	3.58	4.58	4.00	6.75	6.13	3.33	4.50	19.21	12.38	1977-05-01	5	1977	1
1977-06-01	7.08	5.25	9.71	2.83	2.21	3.50	5.29	1.42	2.00	0.92	5.21	5.63	1977-06-01	6	1977	1
1977-07-01	15.41	16.29	17.08	6.25	11.83	11.83	12.29	10.58	10.41	7.21	17.37	7.83	1977-07-01	7	1977	1
1977-08-01	4.33	2.96	4.42	2.33	0.96	1.08	4.96	1.87	2.33	2.04	10.50	9.83	1977-08-01	8	1977	1
1977-09-01	17.37	16.33	16.83	8.58	14.46	11.83	15.09	13.92	13.29	13.88	23.29	25.17	1977-09-01	9	1977	1
1977-10-01	16.75	15.34	12.25	9.42	16.38	11.38	18.50	13.92	14.09	14.46	22.34	29.67	1977-10-01	10	1977	1
1977-11-01	16.71	11.54	12.17	4.17	8.54	7.17	11.12	6.46	8.25	6.21	11.04	15.63	1977-11-01	11	1977	1
1977-12-01	13.37	10.92	12.42	2.37	5.79	6.13	8.96	7.38	6.29	5.71	8.54	12.42	1977-12-01	12	1977	1
1978-01-01	8.33	7.12	7.71	3.54	8.50	7.50	14.71	10.00	11.83	10.00	15.09	20.46	1978-01-01	1	1978	1
1978-02-01	27.25	24.21	18.16	17.46	27.54	18.05	20.96	25.04	20.04	17.50	27.71	21.12	1978-02-01	2	1978	1
1978-03-01	15.04	6.21	16.04	7.87	6.42	6.67	12.29	8.00	10.58	9.33	5.41	17.00	1978-03-01	3	1978	1
1978-04-01	3.42	7.58	2.71	1.38	3.46	2.08	2.67	4.75	4.83	1.67	7.33	13.67	1978-04-01	4	1978	1
1978-05-01	10.54	12.21	9.08	5.29	11.00	10.08	11.17	13.75	11.87	11.79	12.87	27.16	1978-05-01	5	1978	1
1978-06-01	10.37	11.42	6.46	6.04	11.25	7.50	6.46	5.96	7.79	5.46	5.50	10.41	1978-06-01	6	1978	1
1978-07-01	12.46	10.63	11.17	6.75	12.92	9.04	12.42	9.62	12.08	8.04	14.04	16.17	1978-07-01	7	1978	1
1978-08-01	19.33	15.09	20.17	8.83	12.62	10.41	9.33	12.33	9.50	9.92	15.75	18.00	1978-08-01	8	1978	1
1978-09-01	8.42	6.13	9.87	5.25	3.21	5.71	7.25	3.50	7.33	6.50	7.62	15.96	1978-09-01	9	1978	1
1978-10-01	9.50	6.83	10.50	3.88	6.13	4.58	4.21	6.50	6.38	6.54	10.63	14.09	1978-10-01	10	1978	1
1978-11-01	13.59	16.75	11.25	7.08	11.04	8.33	8.17	11.29	10.75	11.25	23.13	25.00	1978-11-01	11	1978	1
1978-12-01	21.29	16.29	24.04	12.79	18.21	19.29	21.54	17.21	16.71	17.83	17.75	25.70	1978-12-01	12	1978	1

216 rows × 16 columns

练习7-可视化

探索泰坦尼克灾难数据

步骤1 导入必要的库

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

步骤2 从以下地址导入数据

path7 = './exercise_data/train.csv'

步骤3 将数据框命名为titanic

titanic = pd.read_csv(path7)
titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

步骤4 将PassengerId设置为索引

titanic.set_index('PassengerId').head()

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

步骤5 绘制一个展示男女乘客比例的扇形图

males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()

# put them into a list called proportions
proportions = [males, females]

# Create a pie chart
plt.pie(
    # using proportions
    proportions,
    
    # with the labels being officer names
    labels = ['Males', 'Females'],
    
    # with no shadows
    shadow = False,
    
    # with colors
    colors = ['blue','red'],
    
    # with one slide exploded out
    explode = (0.15 , 0),
    
    # with the start angle at 90%
    startangle = 90,
    
    # with the percent listed as a fraction
    autopct = '%1.1f%%'
    )

# View the plot drop above
plt.axis('equal')

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()
plt.show()

在这里插入图片描述

步骤6 绘制一个展示船票Fare, 与乘客年龄和性别的散点图

lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)

# set title
lm.set(title = 'Fare x Age')

# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)

在这里插入图片描述

步骤7 有多少人生还？

titanic.Survived.sum()

步骤8 绘制一个展示船票价格的直方图

df = titanic.Fare.sort_values(ascending = False)
df

# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal

# create the plot
plt.hist(df, bins = binsVal)

# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')

# show the plot
plt.show()

在这里插入图片描述

练习8-创建数据框

探索Pokemon数据

步骤1 导入必要的库

import pandas as pd

步骤2 创建一个数据字典

raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']                        
            }

步骤3 将数据字典存为一个名叫pokemon的数据框中

pokemon = pd.DataFrame(raw_data)
pokemon.head()

	evolution	hp	name	pokedex	type
0	Ivysaur	45	Bulbasaur	yes	grass
1	Charmeleon	39	Charmander	no	fire
2	Wartortle	44	Squirtle	yes	water
3	Metapod	45	Caterpie	no	bug

步骤4 数据框的列排序是字母顺序，请重新修改为name, type, hp, evolution, pokedex这个顺序

pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon

	name	type	hp	evolution	pokedex
0	Bulbasaur	grass	45	Ivysaur	yes
1	Charmander	fire	39	Charmeleon	no
2	Squirtle	water	44	Wartortle	yes
3	Caterpie	bug	45	Metapod	no

步骤5 添加一个列place

pokemon['place'] = ['park','street','lake','forest']
pokemon

	name	type	hp	evolution	pokedex	place
0	Bulbasaur	grass	45	Ivysaur	yes	park
1	Charmander	fire	39	Charmeleon	no	street
2	Squirtle	water	44	Wartortle	yes	lake
3	Caterpie	bug	45	Metapod	no	forest

步骤6 查看每个列的数据类型

pokemon.dtypes

name         object
type         object
hp            int64
evolution    object
pokedex      object
place        object
dtype: object

练习9-时间序列

探索Apple公司股价数据

步骤1 导入必要的库

import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt

%matplotlib inline

步骤2 数据集地址

path9 = './exercise_data/Apple_stock.csv'

步骤3 读取数据并存为一个名叫apple的数据框

apple = pd.read_csv(path9)
apple.head()

	Date	Open	High	Low	Close	Volume	Adj Close
0	2014-07-08	96.27	96.80	93.92	95.35	65130000	95.35
1	2014-07-07	94.14	95.99	94.10	95.97	56305400	95.97
2	2014-07-03	93.67	94.10	93.20	94.03	22891800	94.03
3	2014-07-02	93.87	94.06	93.09	93.48	28420900	93.48
4	2014-07-01	93.52	94.07	93.13	93.52	38170200	93.52

步骤4 查看每一列的数据类型

apple.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume         int64
Adj Close    float64
dtype: object

步骤5 将Date这个列转换为datetime类型

apple.Date = pd.to_datetime(apple.Date)
apple['Date'].head()

0   2014-07-08
1   2014-07-07
2   2014-07-03
3   2014-07-02
4   2014-07-01
Name: Date, dtype: datetime64[ns]

步骤6 将Date设置为索引

apple = apple.set_index('Date')
apple.head()

	Open	High	Low	Close	Volume	Adj Close
Date
2014-07-08	96.27	96.80	93.92	95.35	65130000	95.35
2014-07-07	94.14	95.99	94.10	95.97	56305400	95.97
2014-07-03	93.67	94.10	93.20	94.03	22891800	94.03
2014-07-02	93.87	94.06	93.09	93.48	28420900	93.48
2014-07-01	93.52	94.07	93.13	93.52	38170200	93.52

步骤7 有重复的日期吗？

apple.index.is_unique

True

步骤8 将index设置为升序

apple.sort_index(ascending = True).head()

	Open	High	Low	Close	Volume	Adj Close
Date
1980-12-12	28.75	28.87	28.75	28.75	117258400	0.45
1980-12-15	27.38	27.38	27.25	27.25	43971200	0.42
1980-12-16	25.37	25.37	25.25	25.25	26432000	0.39
1980-12-17	25.87	26.00	25.87	25.87	21610400	0.40
1980-12-18	26.63	26.75	26.63	26.63	18362400	0.41

步骤9 找到每个月的最后一个交易日(business day)

apple_month = apple.resample('BM').mean()
apple_month

	Open	High	Low	Close	Volume	Adj Close
Date
1980-12-31	30.481538	30.567692	30.443077	30.443077	2.586252e+07	0.473077
1981-01-30	31.754762	31.826667	31.654762	31.654762	7.249867e+06	0.493810
1981-02-27	26.480000	26.572105	26.407895	26.407895	4.231832e+06	0.411053
1981-03-31	24.937727	25.016818	24.836364	24.836364	7.962691e+06	0.387727
1981-04-30	27.286667	27.368095	27.227143	27.227143	6.392000e+06	0.423333
...	...	...	...	...	...	...
2014-03-31	533.593333	536.453810	530.070952	533.214286	5.954403e+07	75.750000
2014-04-30	540.081905	544.349048	536.262381	541.074286	7.660787e+07	76.867143
2014-05-30	601.301905	606.372857	598.332857	603.195714	6.828177e+07	86.058571
2014-06-30	222.360000	224.084286	220.735714	222.658095	5.745506e+07	91.885714
2014-07-31	94.294000	95.004000	93.488000	94.470000	4.218366e+07	94.470000

404 rows × 6 columns

步骤10 数据集中最早的日期和最晚的日期相差多少天？

(apple.index.max() - apple.index.min()).days

步骤11 在数据中一共有多少个月？

apple_months = apple.resample('BM').mean()
len(apple_months.index)

步骤12 按照时间顺序可视化Adj Close值

appl_open = apple['Adj Close'].plot(title = "Apple Stock")

# changes the size of the graph
fig = appl_open.get_figure()
fig.set_size_inches(13.5, 9)

在这里插入图片描述

练习10-删除数据

探索Iris纸鸢花数据

步骤1 导入必要的库

import pandas as pd

步骤2 数据集地址

path10 ='./exercise_data/iris.csv'

步骤3 将数据集存成变量iris

iris = pd.read_csv(path10)
iris.head()

	5.1	3.5	1.4	0.2	Iris-setosa
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa

步骤4 创建数据框的列名称

iris = pd.read_csv(path10,names = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])
iris.head()

	sepal_length	sepal_width	petal_length	petal_width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

步骤5 数据框中有缺失值吗？

pd.isnull(iris).sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64

步骤6 将列petal_length的第10到19行设置为缺失值

iris.iloc[10:20,2:3] = np.nan
iris.head(20)

	sepal_length	sepal_width	petal_length	petal_width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa
6	4.6	3.4	1.4	0.3	Iris-setosa
7	5.0	3.4	1.5	0.2	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
9	4.9	3.1	1.5	0.1	Iris-setosa
10	5.4	3.7	NaN	0.2	Iris-setosa
11	4.8	3.4	NaN	0.2	Iris-setosa
12	4.8	3.0	NaN	0.1	Iris-setosa
13	4.3	3.0	NaN	0.1	Iris-setosa
14	5.8	4.0	NaN	0.2	Iris-setosa
15	5.7	4.4	NaN	0.4	Iris-setosa
16	5.4	3.9	NaN	0.4	Iris-setosa
17	5.1	3.5	NaN	0.3	Iris-setosa
18	5.7	3.8	NaN	0.3	Iris-setosa
19	5.1	3.8	NaN	0.3	Iris-setosa

步骤7 将缺失值全部替换为1.0

iris.petal_length.fillna(1, inplace = True)
iris

	sepal_length	sepal_width	petal_length	petal_width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa
6	4.6	3.4	1.4	0.3	Iris-setosa
7	5.0	3.4	1.5	0.2	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
9	4.9	3.1	1.5	0.1	Iris-setosa
10	5.4	3.7	1.0	0.2	Iris-setosa
11	4.8	3.4	1.0	0.2	Iris-setosa
12	4.8	3.0	1.0	0.1	Iris-setosa
13	4.3	3.0	1.0	0.1	Iris-setosa
14	5.8	4.0	1.0	0.2	Iris-setosa
15	5.7	4.4	1.0	0.4	Iris-setosa
16	5.4	3.9	1.0	0.4	Iris-setosa
17	5.1	3.5	1.0	0.3	Iris-setosa
18	5.7	3.8	1.0	0.3	Iris-setosa
19	5.1	3.8	1.0	0.3	Iris-setosa
20	5.4	3.4	1.7	0.2	Iris-setosa
21	5.1	3.7	1.5	0.4	Iris-setosa
22	4.6	3.6	1.0	0.2	Iris-setosa
23	5.1	3.3	1.7	0.5	Iris-setosa
24	4.8	3.4	1.9	0.2	Iris-setosa
25	5.0	3.0	1.6	0.2	Iris-setosa
26	5.0	3.4	1.6	0.4	Iris-setosa
27	5.2	3.5	1.5	0.2	Iris-setosa
28	5.2	3.4	1.4	0.2	Iris-setosa
29	4.7	3.2	1.6	0.2	Iris-setosa
...	...	...	...	...	...
120	6.9	3.2	5.7	2.3	Iris-virginica
121	5.6	2.8	4.9	2.0	Iris-virginica
122	7.7	2.8	6.7	2.0	Iris-virginica
123	6.3	2.7	4.9	1.8	Iris-virginica
124	6.7	3.3	5.7	2.1	Iris-virginica
125	7.2	3.2	6.0	1.8	Iris-virginica
126	6.2	2.8	4.8	1.8	Iris-virginica
127	6.1	3.0	4.9	1.8	Iris-virginica
128	6.4	2.8	5.6	2.1	Iris-virginica
129	7.2	3.0	5.8	1.6	Iris-virginica
130	7.4	2.8	6.1	1.9	Iris-virginica
131	7.9	3.8	6.4	2.0	Iris-virginica
132	6.4	2.8	5.6	2.2	Iris-virginica
133	6.3	2.8	5.1	1.5	Iris-virginica
134	6.1	2.6	5.6	1.4	Iris-virginica
135	7.7	3.0	6.1	2.3	Iris-virginica
136	6.3	3.4	5.6	2.4	Iris-virginica
137	6.4	3.1	5.5	1.8	Iris-virginica
138	6.0	3.0	4.8	1.8	Iris-virginica
139	6.9	3.1	5.4	2.1	Iris-virginica
140	6.7	3.1	5.6	2.4	Iris-virginica
141	6.9	3.1	5.1	2.3	Iris-virginica
142	5.8	2.7	5.1	1.9	Iris-virginica
143	6.8	3.2	5.9	2.3	Iris-virginica
144	6.7	3.3	5.7	2.5	Iris-virginica
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

150 rows × 5 columns

步骤8 删除列class

del iris['class']
iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

步骤9 将数据框前三行设置为缺失值

iris.iloc[0:3 ,:] = np.nan
iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

步骤10 删除有缺失值的行

iris = iris.dropna(how='any')
iris.head()

	sepal_length	sepal_width	petal_length	petal_width
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
6	4.6	3.4	1.4	0.3
7	5.0	3.4	1.5	0.2

步骤11 重新设置索引

iris = iris.reset_index(drop = True)
iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	4.6	3.1	1.5	0.2
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2

O_Oreo

关注

2
点赞
踩
39

收藏

觉得还不错? 一键收藏
0
评论
Python数据分析习题（基于pandas&numpy模块）（下）

练习题目录习题编号内容相应数据集练习6 - 统计探索风速数据 wind.data练习7 - 可视化探索泰坦尼克灾难数据 train.csv练习8 - 创建数据框探索Pokemon数据练习中手动内置的数据练习9 - 时间序列探索Apple公司股价数据 Apple_stock.csv练习10 - 删除数据探索Iris纸鸢花数据 iris.csv练习6-统计探索风速数据步骤1 导入必要的库import pandas as pdimport datetime步骤2 从以
复制链接

扫一扫