练习6-统计
探索风速数据
步骤1 导入必要的库
运行以下代码
import pandas as pd
import datetime
步骤2 从以下地址导入数据
import pandas as pd
运行以下代码
path6 = “…/input/pandas_exercise/pandas_exercise/exercise_data/wind.data” # wind.data
步骤3 将数据作存储并且设置前三列为合适的索引
import datetime
运行以下代码
data = pd.read_table(path6, sep = “\s+”, parse_dates = [[0,1,2]])
data.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 2061-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 2061-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 2061-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 2061-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 2061-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
步骤4 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug
运行以下代码
def fix_century(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
apply the function fix_century on the column and replace the values to the right ones
data[‘Yr_Mo_Dy’] = data[‘Yr_Mo_Dy’].apply(fix_century)
data.info()
data.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
步骤5 将日期设为索引,注意数据类型,应该是datetime64[ns]
运行以下代码
transform Yr_Mo_Dy it to date type datetime64
data[“Yr_Mo_Dy”] = pd.to_datetime(data[“Yr_Mo_Dy”])
set ‘Yr_Mo_Dy’ as the index
data = data.set_index(‘Yr_Mo_Dy’)
data.head()
data.info()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
步骤6 对应每一个location,一共有多少数据值缺失
运行以下代码
data.isnull().sum()
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
步骤7 对应每一个location,一共有多少完整的数据值
运行以下代码
data.shape[0] - data.isnull().sum()
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
步骤8 对于全体数据,计算风速的平均值
运行以下代码
data.mean().mean()
10.227982360836924
步骤9 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差
运行以下代码
loc_stats = pd.DataFrame()
loc_stats[‘min’] = data.min() # min
loc_stats[‘max’] = data.max() # max
loc_stats[‘mean’] = data.mean() # mean
loc_stats[‘std’] = data.std() # standard deviations
loc_stats
min max mean std
RPT 0.67 35.80 12.362987 5.618413
VAL 0.21 33.37 10.644314 5.267356
ROS 1.50 33.84 11.660526 5.008450
KIL 0.00 28.46 6.306468 3.605811
SHA 0.13 37.54 10.455834 4.936125
BIR 0.00 26.16 7.092254 3.968683
DUB 0.00 30.37 9.797343 4.977555
CLA 0.00 31.08 8.495053 4.499449
MUL 0.00 25.88 8.493590 4.166872
CLO 0.04 28.21 8.707332 4.503954
BEL 0.13 42.38 13.121007 5.835037
MAL 0.67 42.54 15.599079 6.699794
步骤10 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差
运行以下代码
create the dataframe
day_stats = pd.DataFrame()
this time we determine axis equals to one so it gets each row.
day_stats[‘min’] = data.min(axis = 1) # min
day_stats[‘max’] = data.max(axis = 1) # max
day_stats[‘mean’] = data.mean(axis = 1) # mean
day_stats[‘std’] = data.std(axis = 1) # standard deviations
day_stats.head()
min max mean std
Yr_Mo_Dy
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356
步骤11 对于每一个location,计算一月份的平均风速
注意,1961年的1月和1962年的1月应该区别对待
运行以下代码
creates a new column ‘date’ and gets the values from the index
data[‘date’] = data.index
creates a column for each value from date
data[‘month’] = data[‘date’].apply(lambda date: date.month)
data[‘year’] = data[‘date’].apply(lambda date: date.year)
data[‘day’] = data[‘date’].apply(lambda date: date.day)
gets all value from the month 1 and assign to janyary_winds
january_winds = data.query(‘month == 1’)
gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,‘RPT’:“MAL”].mean()
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
步骤12 对于数据记录按照年为频率取样
运行以下代码
data.query(‘month == 1 and day == 1’)
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date month year day
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01 1 1961 1
1962-01-01 9.29 3.42 11.54 3.50 2.21 1.96 10.41 2.79 3.54 5.17 4.38 7.92 1962-01-01 1 1962 1
1963-01-01 15.59 13.62 19.79 8.38 12.25 10.00 23.45 15.71 13.59 14.37 17.58 34.13 1963-01-01 1 1963 1
1964-01-01 25.80 22.13 18.21 13.25 21.29 14.79 14.12 19.58 13.25 16.75 28.96 21.00 1964-01-01 1 1964 1
1965-01-01 9.54 11.92 9.00 4.38 6.08 5.21 10.25 6.08 5.71 8.63 12.04 17.41 1965-01-01 1 1965 1
1966-01-01 22.04 21.50 17.08 12.75 22.17 15.59 21.79 18.12 16.66 17.83 28.33 23.79 1966-01-01 1 1966 1
1967-01-01 6.46 4.46 6.50 3.21 6.67 3.79 11.38 3.83 7.71 9.08 10.67 20.91 1967-01-01 1 1967 1
1968-01-01 30.04 17.88 16.25 16.25 21.79 12.54 18.16 16.62 18.75 17.62 22.25 27.29 1968-01-01 1 1968 1
1969-01-01 6.13 1.63 5.41 1.08 2.54 1.00 8.50 2.42 4.58 6.34 9.17 16.71 1969-01-01 1 1969 1
1970-01-01 9.59 2.96 11.79 3.42 6.13 4.08 9.00 4.46 7.29 3.50 7.33 13.00 1970-01-01 1 1970 1
1971-01-01 3.71 0.79 4.71 0.17 1.42 1.04 4.63 0.75 1.54 1.08 4.21 9.54 1971-01-01 1 1971 1
1972-01-01 9.29 3.63 14.54 4.25 6.75 4.42 13.00 5.33 10.04 8.54 8.71 19.17 1972-01-01 1 1972 1
1973-01-01 16.50 15.92 14.62 7.41 8.29 11.21 13.54 7.79 10.46 10.79 13.37 9.71 1973-01-01 1 1973 1
1974-01-01 23.21 16.54 16.08 9.75 15.83 11.46 9.54 13.54 13.83 16.66 17.21 25.29 1974-01-01 1 1974 1
1975-01-01 14.04 13.54 11.29 5.46 12.58 5.58 8.12 8.96 9.29 5.17 7.71 11.63 1975-01-01 1 1975 1
1976-01-01 18.34 17.67 14.83 8.00 16.62 10.13 13.17 9.04 13.13 5.75 11.38 14.96 1976-01-01 1 1976 1
1977-01-01 20.04 11.92 20.25 9.13 9.29 8.04 10.75 5.88 9.00 9.00 14.88 25.70 1977-01-01 1 1977 1
1978-01-01 8.33 7.12 7.71 3.54 8.50 7.50 14.71 10.00 11.83 10.00 15.09 20.46 1978-01-01 1 1978 1
步骤13 对于数据记录按照月为频率取样
运行以下代码
data.query(‘day == 1’)
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date month year day
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01 1 1961 1
1961-02-01 14.25 15.12 9.04 5.88 12.08 7.17 10.17 3.63 6.50 5.50 9.17 8.00 1961-02-01 2 1961 1
1961-03-01 12.67 13.13 11.79 6.42 9.79 8.54 10.25 13.29 NaN 12.21 20.62 NaN 1961-03-01 3 1961 1
1961-04-01 8.38 6.34 8.33 6.75 9.33 9.54 11.67 8.21 11.21 6.46 11.96 7.17 1961-04-01 4 1961 1
1961-05-01 15.87 13.88 15.37 9.79 13.46 10.17 9.96 14.04 9.75 9.92 18.63 11.12 1961-05-01 5 1961 1
1961-06-01 15.92 9.59 12.04 8.79 11.54 6.04 9.75 8.29 9.33 10.34 10.67 12.12 1961-06-01 6 1961 1
1961-07-01 7.21 6.83 7.71 4.42 8.46 4.79 6.71 6.00 5.79 7.96 6.96 8.71 1961-07-01 7 1961 1
1961-08-01 9.59 5.09 5.54 4.63 8.29 5.25 4.21 5.25 5.37 5.41 8.38 9.08 1961-08-01 8 1961 1
1961-09-01 5.58 1.13 4.96 3.04 4.25 2.25 4.63 2.71 3.67 6.00 4.79 5.41 1961-09-01 9 1961 1
1961-10-01 14.25 12.87 7.87 8.00 13.00 7.75 5.83 9.00 7.08 5.29 11.79 4.04 1961-10-01 10 1961 1
1961-11-01 13.21 13.13 14.33 8.54 12.17 10.21 13.08 12.17 10.92 13.54 20.17 20.04 1961-11-01 11 1961 1
1961-12-01 9.67 7.75 8.00 3.96 6.00 2.75 7.25 2.50 5.58 5.58 7.79 11.17 1961-12-01 12 1961 1
1962-01-01 9.29 3.42 11.54 3.50 2.21 1.96 10.41 2.79 3.54 5.17 4.38 7.92 1962-01-01 1 1962 1
1962-02-01 19.12 13.96 12.21 10.58 15.71 10.63 15.71 11.08 13.17 12.62 17.67 22.71 1962-02-01 2 1962 1
1962-03-01 8.21 4.83 9.00 4.83 6.00 2.21 7.96 1.87 4.08 3.92 4.08 5.41 1962-03-01 3 1962 1
1962-04-01 14.33 12.25 11.87 10.37 14.92 11.00 19.79 11.67 14.09 15.46 16.62 23.58 1962-04-01 4 1962 1
1962-05-01 9.62 9.54 3.58 3.33 8.75 3.75 2.25 2.58 1.67 2.37 7.29 3.25 1962-05-01 5 1962 1
1962-06-01 5.88 6.29 8.67 5.21 5.00 4.25 5.91 5.41 4.79 9.25 5.25 10.71 1962-06-01 6 1962 1
1962-07-01 8.67 4.17 6.92 6.71 8.17 5.66 11.17 9.38 8.75 11.12 10.25 17.08 1962-07-01 7 1962 1
1962-08-01 4.58 5.37 6.04 2.29 7.87 3.71 4.46 2.58 4.00 4.79 7.21 7.46 1962-08-01 8 1962 1
1962-09-01 10.00 12.08 10.96 9.25 9.29 7.62 7.41 8.75 7.67 9.62 14.58 11.92 1962-09-01 9 1962 1
1962-10-01 14.58 7.83 19.21 10.08 11.54 8.38 13.29 10.63 8.21 12.92 18.05 18.12 1962-10-01 10 1962 1
1962-11-01 16.88 13.25 16.00 8.96 13.46 11.46 10.46 10.17 10.37 13.21 14.83 15.16 1962-11-01 11 1962 1
1962-12-01 18.38 15.41 11.75 6.79 12.21 8.04 8.42 10.83 5.66 9.08 11.50 11.50 1962-12-01 12 1962 1
1963-01-01 15.59 13.62 19.79 8.38 12.25 10.00 23.45 15.71 13.59 14.37 17.58 34.13 1963-01-01 1 1963 1
1963-02-01 15.41 7.62 24.67 11.42 9.21 8.17 14.04 7.54 7.54 10.08 10.17 17.67 1963-02-01 2 1963 1
1963-03-01 16.75 19.67 17.67 8.87 19.08 15.37 16.21 14.29 11.29 9.21 19.92 19.79 1963-03-01 3 1963 1
1963-04-01 10.54 9.59 12.46 7.33 9.46 9.59 11.79 11.87 9.79 10.71 13.37 18.21 1963-04-01 4 1963 1
1963-05-01 18.79 14.17 13.59 11.63 14.17 11.96 14.46 12.46 12.87 13.96 15.29 21.62 1963-05-01 5 1963 1
1963-06-01 13.37 6.87 12.00 8.50 10.04 9.42 10.92 12.96 11.79 11.04 10.92 13.67 1963-06-01 6 1963 1
… … … … … … … … … … … … … … … … …
1976-07-01 8.50 1.75 6.58 2.13 2.75 2.21 5.37 2.04 5.88 4.50 4.96 10.63 1976-07-01 7 1976 1
1976-08-01 13.00 8.38 8.63 5.83 12.92 8.25 13.00 9.42 10.58 11.34 14.21 20.25 1976-08-01 8 1976 1
1976-09-01 11.87 11.00 7.38 6.87 7.75 8.33 10.34 6.46 10.17 9.29 12.75 19.55 1976-09-01 9 1976 1
1976-10-01 10.96 6.71 10.41 4.63 7.58 5.04 5.04 5.54 6.50 3.92 6.79 5.00 1976-10-01 10 1976 1
1976-11-01 13.96 15.67 10.29 6.46 12.79 9.08 10.00 9.67 10.21 11.63 23.09 21.96 1976-11-01 11 1976 1
1976-12-01 13.46 16.42 9.21 4.54 10.75 8.67 10.88 4.83 8.79 5.91 8.83 13.67 1976-12-01 12 1976 1
1977-01-01 20.04 11.92 20.25 9.13 9.29 8.04 10.75 5.88 9.00 9.00 14.88 25.70 1977-01-01 1 1977 1
1977-02-01 11.83 9.71 11.00 4.25 8.58 8.71 6.17 5.66 8.29 7.58 11.71 16.50 1977-02-01 2 1977 1
1977-03-01 8.63 14.83 10.29 3.75 6.63 8.79 5.00 8.12 7.87 6.42 13.54 13.67 1977-03-01 3 1977 1
1977-04-01 21.67 16.00 17.33 13.59 20.83 15.96 25.62 17.62 19.41 20.67 24.37 30.09 1977-04-01 4 1977 1
1977-05-01 6.42 7.12 8.67 3.58 4.58 4.00 6.75 6.13 3.33 4.50 19.21 12.38 1977-05-01 5 1977 1
1977-06-01 7.08 5.25 9.71 2.83 2.21 3.50 5.29 1.42 2.00 0.92 5.21 5.63 1977-06-01 6 1977 1
1977-07-01 15.41 16.29 17.08 6.25 11.83 11.83 12.29 10.58 10.41 7.21 17.37 7.83 1977-07-01 7 1977 1
1977-08-01 4.33 2.96 4.42 2.33 0.96 1.08 4.96 1.87 2.33 2.04 10.50 9.83 1977-08-01 8 1977 1
1977-09-01 17.37 16.33 16.83 8.58 14.46 11.83 15.09 13.92 13.29 13.88 23.29 25.17 1977-09-01 9 1977 1
1977-10-01 16.75 15.34 12.25 9.42 16.38 11.38 18.50 13.92 14.09 14.46 22.34 29.67 1977-10-01 10 1977 1
1977-11-01 16.71 11.54 12.17 4.17 8.54 7.17 11.12 6.46 8.25 6.21 11.04 15.63 1977-11-01 11 1977 1
1977-12-01 13.37 10.92 12.42 2.37 5.79 6.13 8.96 7.38 6.29 5.71 8.54 12.42 1977-12-01 12 1977 1
1978-01-01 8.33 7.12 7.71 3.54 8.50 7.50 14.71 10.00 11.83 10.00 15.09 20.46 1978-01-01 1 1978 1
1978-02-01 27.25 24.21 18.16 17.46 27.54 18.05 20.96 25.04 20.04 17.50 27.71 21.12 1978-02-01 2 1978 1
1978-03-01 15.04 6.21 16.04 7.87 6.42 6.67 12.29 8.00 10.58 9.33 5.41 17.00 1978-03-01 3 1978 1
1978-04-01 3.42 7.58 2.71 1.38 3.46 2.08 2.67 4.75 4.83 1.67 7.33 13.67 1978-04-01 4 1978 1
1978-05-01 10.54 12.21 9.08 5.29 11.00 10.08 11.17 13.75 11.87 11.79 12.87 27.16 1978-05-01 5 1978 1
1978-06-01 10.37 11.42 6.46 6.04 11.25 7.50 6.46 5.96 7.79 5.46 5.50 10.41 1978-06-01 6 1978 1
1978-07-01 12.46 10.63 11.17 6.75 12.92 9.04 12.42 9.62 12.08 8.04 14.04 16.17 1978-07-01 7 1978 1
1978-08-01 19.33 15.09 20.17 8.83 12.62 10.41 9.33 12.33 9.50 9.92 15.75 18.00 1978-08-01 8 1978 1
1978-09-01 8.42 6.13 9.87 5.25 3.21 5.71 7.25 3.50 7.33 6.50 7.62 15.96 1978-09-01 9 1978 1
1978-10-01 9.50 6.83 10.50 3.88 6.13 4.58 4.21 6.50 6.38 6.54 10.63 14.09 1978-10-01 10 1978 1
1978-11-01 13.59 16.75 11.25 7.08 11.04 8.33 8.17 11.29 10.75 11.25 23.13 25.00 1978-11-01 11 1978 1
1978-12-01 21.29 16.29 24.04 12.79 18.21 19.29 21.54 17.21 16.71 17.83 17.75 25.70 1978-12-01 12 1978 1
216 rows × 16 columns