共享单车EDA与模型选择
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from datetime import datetime
import os
import warnings
warnings.filterwarnings(action = 'ignore')
Kaggle-competition-bike-sharing-demand
EDA
data items
-
Numerical type: (use directly)
- temp: actual temperature
- atemp: body temperature
- humidity: humidity
- windspeed: wind speed
- casual: the number of bikes rented by unregistered users
- registered: Number of registered users rented bikes
- count: total number of rental bikes
-
Time series:
datetime: Change to a single year, month, day, hour, and week -
Categorized data: (create dummies )
- season: season. 1: Spring; 2: Summer; 3: Autumn; 4: Winter
- holiday: Whether it is a holiday. 0: No; 1: Yes
- workingday: Whether it is a working day. 0: No; 1: Yes
- weather: weather. 1: sunny; 2: cloudy; 3: light rain or snow; 4: severe weather
** 目标:通过将历史使用情况与天气数据相结合,预测华盛顿共享单车的租赁需求,从而预测自行车租赁需求
导入数据
# import data
df = pd.read_csv('train.csv')
print(df.head(),'\n','df.shape: {}'.format(df.shape))
datetime season holiday workingday weather temp atemp \
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395
humidity windspeed casual registered count
0 81 0.0 3 13 16
1 80 0.0 8 32 40
2 80 0.0 5 27 32
3 75 0.0 3 10 13
4 75 0.0 0 1 1
df.shape: (10886, 12)
df.describe()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.00000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 |
mean | 2.506614 | 0.028569 | 0.680875 | 1.418427 | 20.23086 | 23.655084 | 61.886460 | 12.799395 | 36.021955 | 155.552177 | 191.574132 |
std | 1.116174 | 0.166599 | 0.466159 | 0.633839 | 7.79159 | 8.474601 | 19.245033 | 8.164537 | 49.960477 | 151.039033 | 181.144454 |
min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.82000 | 0.760000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 13.94000 | 16.665000 | 47.000000 | 7.001500 | 4.000000 | 36.000000 | 42.000000 |
50% | 3.000000 | 0.000000 | 1.000000 | 1.000000 | 20.50000 | 24.240000 | 62.000000 | 12.998000 | 17.000000 | 118.000000 | 145.000000 |
75% | 4.000000 | 0.000000 | 1.000000 | 2.000000 | 26.24000 | 31.060000 | 77.000000 | 16.997900 | 49.000000 | 222.000000 | 284.000000 |
max | 4.000000 | 1.000000 | 1.000000 | 4.000000 | 41.00000 | 45.455000 | 100.000000 | 56.996900 | 367.000000 | 886.000000 | 977.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null object
1 season 10886 non-null int64
2 holiday 10886 non-null int64
3 workingday 10886 non-null int64
4 weather 10886 non-null int64
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int64
8 windspeed 10886 non-null float64
9 casual 10886 non-null int64
10 registered 10886 non-null int64
11 count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB
缺失值分析
df.isnull().sum()
datetime 0
season 0
holiday 0
workingday 0
weather 0
temp 0
atemp 0
humidity 0
windspeed 0
casual 0
registered 0
count 0
dtype: int64
labels = df[:100]['datetime'].astype('str').
labels
File "<ipython-input-6-ffccf8289615>", line 1
labels = df[:100]['datetime'].astype('str').
^
SyntaxError: invalid syntax
plt.style