1.项目摘要说明
项目目的:对于数据分析的练习
数据来源:kaggle
源码.数据集以及字段说明 百度云链接:
地址:https://pan.baidu.com/s/1UD5HD69bNEsX2EkjaQ1IPg
提取码:8gd8
本项目分析目标:
- 对数据进行基础分析 发生事故最多的州,什么时候容易发生事故,事故发生时天气状况及可视化应用:讲述2017美国发生事故的总体情况等等
- 利用xgboost对事故严重程度进行预测,查看事故严重程度和什么因素比较有关
2.数据处理(仅为分析处理,建模的处理放在后面)
原数据集(US_Accidents_Dec19.csv)是一个数据量49列共300W数据量包含2016到2019的交通事故,但考虑到电脑硬件及时间问题,仅选取2017年间的事故进行分析(详情源文件可见)
#截取2017年的
import pandas as pd
data = pd.read_csv('./US_Accidents_Dec19.csv')
datacopy = data.copy()
datacopy['Start_Time'] = pd.to_datetime(datacopy['Start_Time'])
datacopy['year'] = datacopy['Start_Time'].apply(lambda x:x.year)
data1 = datacopy[datacopy['year']==2017]
data1.to_csv('./USaccident2017.csv')
对USaccident2017.csv开始分析
导入需要使用的包
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import folium
import pandas as pd
import webbrowser
from pyecharts import options as opts
from pyecharts.charts import Page, Pie, Bar, Line, Scatter
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
import xgboost as xgb
data = pd.read_csv('./USaccident2017.csv')
data.shape #(717483, 51)
data.head()
Unnamed: 0 | ID | Source | TMC | Severity | Start_Time | End_Time | Start_Lat | Start_Lng | End_Lat | End_Lng | Distance(mi) | Description | Number | Street | Side | City | County | State | Zipcode | Country | Timezone | Airport_Code | Weather_Timestamp | Temperature(F) | Wind_Chill(F) | Humidity(%) | Pressure(in) | Visibility(mi) | Wind_Direction | Wind_Speed(mph) | Precipitation(in) | Weather_Condition | Amenity | Bump | Crossing | Give_Way | Junction | No_Exit | Railway | Roundabout | Station | Stop | Traffic_Calming | Traffic_Signal | Turning_Loop | Sunrise_Sunset | Civil_Twilight | Nautical_Twilight | Astronomical_Twilight | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9206 | A-9207 | MapQuest | 201.0 | 3 | 2017-01-01 00:17:36 | 2017-01-01 00:47:12 | 37.925392 | -122.320595 | NaN | NaN | 0.01 | Accident on I-80 Westbound at Exit 15 Cutting ... | NaN | I-80 E | R | El Cerrito | Contra Costa | CA | 94530 | US | US/Pacific | KCCR | 2017-01-01 00:53:00 | 44.1 | 40.8 | 79.0 | 29.91 | 10.0 | WSW | 5.8 | NaN | Partly Cloudy | False | False | False | False | False | False | False | False | False | False | False | True | False | Night | Night | Night | Night | 2017 |
1 | 9207 | A-9208 | MapQuest | 201.0 | 3 | 2017-01-01 00:26:08 | 2017-01-01 01:16:06 | 37.878185 | -122.307175 | NaN | NaN | 0.01 | Accident on I-580 Southbound at Exit 12 I-80 I... | NaN | I-580 W | R | Berkeley | Alameda | CA | 94710 | US | US/Pacific | KOAK | 2017-01-01 00:53:00 | 51.1 | NaN | 83.0 | 29.97 | 10.0 | West | 11.5 | NaN | Overcast | False | False | True | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
2 | 9208 | A-9209 | MapQuest | 201.0 | 2 | 2017-01-01 00:53:41 | 2017-01-01 01:22:35 | 38.014820 | -121.640579 | NaN | NaN | 0.00 | Accident on Taylor Rd Southbound at Bethel Isl... | 2998.0 | Taylor Ln | R | Oakley | Contra Costa | CA | 94561 | US | US/Pacific | KCCR | 2017-01-01 00:53:00 | 44.1 | 40.8 | 79.0 | 29.91 | 10.0 | WSW | 5.8 | NaN | Partly Cloudy | False | False | False | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
3 | 9209 | A-9210 | MapQuest | 241.0 | 3 | 2017-01-01 01:18:51 | 2017-01-01 01:48:01 | 37.912056 | -122.323982 | NaN | NaN | 0.01 | Lane blocked and queueing traffic due to accid... | NaN | Bayview Ave | R | Richmond | Contra Costa | CA | 94804 | US | US/Pacific | KCCR | 2017-01-01 01:11:00 | 44.1 | 42.5 | 82.0 | 29.95 | 9.0 | SW | 3.5 | NaN | Mostly Cloudy | False | False | False | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
4 | 9210 | A-9211 | MapQuest | 222.0 | 3 | 2017-01-01 01:20:12 | 2017-01-01 01:49:47 | 37.925392 | -122.320595 | NaN | NaN | 0.01 | Queueing traffic due to accident on I-80 Westb... | NaN | I-80 E | R | El Cerrito | Contra Costa | CA | 94530 | US | US/Pacific | KCCR | 2017-01-01 01:11:00 | 44.1 | 42.5 | 82.0 | 29.95 | 9.0 | SW | 3.5 | NaN | Mostly Cloudy | False | False | False | False | False | False | False | False | False | False | False | True | False | Night | Night | Night | Night | 2017 |
字段说明
https://www.jianshu.com/p/9e597dc8ae71
#查看空值情况
data.isnull().sum()[data.isnull().sum()!=0]
#处理空值
#无影响或者不分析的列 删除
deletelist= ['Unnamed: 0', 'ID','TMC', 'End_Lat', 'End_Lng', 'Airport_Code','Weather_Timestamp','Wind_Chill(F)',
'Civil_Twilight', 'Nautical_Twilight',
'Astronomical_Twilight', 'year','Number']
data1 = data.drop(deletelist, axis=1)
#删除有空值的行
data1 = data1.dropna(axis = 0,subset=['City','Zipcode','Timezone','Sunrise_Sunset'])
#温度湿度气压能见度用均值填补
data1['Temperature(F)'] = data1['Temperature(F)'].fillna(data1['Temperature(F)'].mean())
data1['Humidity(%)'] = data1['Humidity(%)'].fillna(data1['Humidity(%)'].mean())
data1['Pressure(in)'] = data1['Pressure(in)'].fillna(data1['Pressure(in)'].mean())
data1['Visibility(mi)'] = data1['Visibility(mi)'].fillna(data1['Visibility(mi)'].mean())
#风速使用近邻填补
data1['Wind_Speed(mph)'] = data1['Wind_Speed(mph)'].interpolate(method='nearest', order=4)
#天气状况风向用众数填补
data1['Weather_Condition'] = data1['Weather_Condition'].fillna(data1['Weather_Condition'].