车祸gif_车祸严重程度

最新推荐文章于 2024-09-14 19:55:48 发布

weixin_26750511

最新推荐文章于 2024-09-14 19:55:48 发布

阅读量290

点赞数

文章标签： python java

原文链接：https://medium.com/@aryampatel2001/car-accident-severity-an-8ecfce7d47b

版权

车祸gif

1.简介(1. Introduction)

Car accidents or on road collisions and something we witness daily on the news. The vehicle count on road today is much larger than it used to be 10 years ago.

车祸或道路碰撞以及我们每天在新闻中看到的事情。今天，公路上的车辆数量比10年前要大得多。

The predictive analysis performed here aims towards analyzing the “Severity” of the accident/collision based on road conditions, lighting conditions, area of collision, number of people involved and many more factors as such. Knowing the severity of any such collision beforehand will lead to prevention and prompt action.

此处进行的预测分析旨在根据道路状况，光照条件，碰撞区域，涉案人数以及更多其他因素来分析事故/碰撞的“严重性”。事先了解任何此类碰撞的严重性将有助于预防并Swift采取行动。

2. Data

2.资料

All the collision data used in this analysis is taken from ArcGIS, which was provided by Seattle Police Department and recorded by traffic records. The data provided is that of collisions which took place in the city of Seattle, from year 2004 till present.

此分析中使用的所有碰撞数据均取自ArcGIS，该数据由西雅图警察局提供，并由交通记录记录。提供的数据是从2004年到现在在西雅图市发生的撞车事故的数据。

Mentioned below is list of features that was available in the raw data:

下面提到的是原始数据中可用的功能列表：

Feature List

功能列表

There are in total 38 data columns in the dataset including the 3 target related columns. We will keep various aspects in mind while deciding the importance of a particular column or the transformation it may need before we feed it to the model.

数据集中共有38个数据列，其中包括3个与目标相关的列。在确定特定列的重要性或在将其输入模型之前可能需要进行的转换时，我们将牢记各个方面。

Some of the given data columns are features related to or identifying a single particular accident, thus may not be very much useful for our predictive analysis. These features include:

某些给定的数据列是与单个事故相关或用于识别单个特定事故的功能，因此对于我们的预测分析可能不是很有用。这些功能包括：

SDOTCOLNUM, Coordinates, LOCATION, INCDTTM, INCDATE, REPORTNO, COLDETKEY, INCKEY, OBJECTID.

SDOTCOLNUM，坐标，位置，INCDTTM，INCATE，REPORTNO，COLDETKEY，INCKEY，OBJECTID。

There are some description columns for a given code. Columns ST_COLDESC, SDOT_COLDESC and EXCEPTRSNDESC are description columns for code which is already specified in the given dataset.

对于给定的代码，有一些描述列。列ST_COLDESC，SDOT_COLDESC和EXCEPTRSNDESC是已在给定数据集中指定的代码的描述列。

There are also data columns which has missing data in abundance. Column EXCEPTRSNCODE, EXCEPTRSNDESC, PEDROWNOTGRNT, SPEEDING, INATTENTIONIND and INTKEY have more than 50% of data missing. Although few of these columns can be very crucial indicator of collision severity, it would be misguiding to use it with so many missing rows and very difficult to fill in these categorical values.

也有数据列，其中包含大量丢失的数据。 列EXCEPTRSNCODE， EXCEPTRSNDESC， PEDROWNOTGRNT，SPEEDING，INATTENTIONIND和INTKEY丢失了50％以上的数据。尽管这些列中很少有一个是冲突严重性的非常关键的指标，但是将其与如此多的缺失行一起使用会造成误导，并且很难填写这些类别值。

Columns mentioned in all the three categories above will not be used in the model that we are going to build. Most of the columns that remains are categorical and will require one-hot and label encoding before we can use them as a feature for our model.

上面所有三个类别中提到的列都不会在我们将要构建的模型中使用。剩下的大多数列都是分类的，在将它们用作模型的功能之前，需要一键编码和标签编码。

3. Methodology

3.方法论

3.1 Exploratory Data Analysis

3.1探索性数据分析

First part of the process will be to explore the data and understand that how a particular data column is distributed.

该过程的第一部分将是探索数据并了解特定数据列的分布方式。

Most of our data columns are categorical and we need to know that how affect the severity of the accident.

我们的大多数数据列都是分类的，我们需要知道这如何影响事故的严重性。

Frequency of Property Damage Only Collision and Injury Collision with respect to collision type feature

与碰撞类型特征有关的仅财产损失碰撞和伤害碰撞的频率

Class distribution of ‘Matched’ and ‘Unmatched’ categories of status variable

状态变量的“匹配”和“不匹配”类别的类别分布

There are low cardinality categorical variables with 6–7 categories, moderate cardinality categorical variables with 40–70 categories and very high cardinality categorical variable with 1500+ categories.

有6-7个类别的低基数分类变量，40-70个类别的中度基数分类变量和1500+个类别的非常高的基数分类变量。

3.2 Feature Engineering

3.2特征工程

Mostly all the variables (except the features which defines the number of people, vehicles etc.) are nominal features; i.e., features where the categories are only labelled without any order of precedence. Preferred encoding for these categories is One-Hot encoding. However, One-Hot encoding will generate around 1500 data columns for just one high cardinality categorical variable, which will be very expensive to work with.

通常，所有变量(定义人数，车辆等的要素除外)均为名义要素；即，仅对类别进行标记而没有任何优先顺序的特征。这些类别的首选编码是一键编码。但是，“一键编码”将仅针对一个高基数分类变量生成大约1500个数据列，使用它会非常昂贵。

We can get over this hurdle by using feature hashing. Feature hashing is an encoding technique which is used to encode high cardinality feature by hashing them. By this we can pull down the number of encoded data columns to 32–64 even for variables with >1500 categories.

我们可以通过使用特征哈希来克服这一障碍。特征散列是一种编码技术，用于通过对高基数特征进行散列来对其进行编码。这样，即使对于类别> 1500的变量，我们也可以将编码数据列的数目下拉至32–64。

Distribution of all missing data in the training set was found to be:

发现训练集中所有缺失数据的分布为：

Distribution of all missing data in the training set

分布训练集中的所有缺失数据

As the class proportion is not getting much affected by dropping these data rows, we will proceed to do so.

由于删除这些数据行不会对类比例产生太大影响，因此我们将继续这样做。

After the process of feature hashing and one-hot encoding, we obtain in total 208 feature columns. We are using Random Forest to get the feature importance, eliminating 40 least important features and correlation matrix to detect >90% correlations.

经过特征哈希和一键编码的过程，我们总共获得了208个特征列。我们正在使用随机森林来获得特征重要性，消除了40个最不重要的特征和相关矩阵以检测> 90％的相关性。

After removing the least important and highly correlated features we are left with 160 features to train the model with.

删除最不重要和高度相关的特征后，我们剩下160个特征用于训练模型。

3.3 Modelling

3.3建模

As it was clear from above analysis that we have had a skewed dataset. This resulted in a low recall on class 2 and as a result low F1 score.

从以上分析可以明显看出，我们的数据集有偏斜。这导致第2类的召回率较低，因此F1分数较低。

To solve this problem, we used smote to oversample the rare class and generated the cross-validation score again. While doing oversampling we have to keep in mind that oversampling should be done on each iteration of cross-validation and not on the whole training set.

为了解决这个问题，我们使用smote对稀有类别进行过采样，然后再次生成交叉验证得分。在进行过采样时，我们必须记住，应该在交叉验证的每个迭代中都进行过采样，而不是对整个训练集进行过采样。

As a result, we observed that although the recall on class 2 and F1 increased a little bit, it decreased the accuracy too. Considering the increase in computational expense due to increased data, oversampling didn’t prove to be worth the effort in this case.

结果，我们观察到，尽管第2类和F1的召回率有所增加，但其准确性也降低了。考虑到由于数据增加而导致的计算开销的增加，在这种情况下，过采样并没有值得付出的努力。

We used the XG Boost Classifier to start with and plotted the learning curve to see if the model is overfitting the training data. We observed that converged training and validation error were close to each other, which means that we can use high variance algorithms like Random Forest, XG Boost and Support Vector Machine, and we can also use the high number of features that we are using.

我们从XG Boost分类器开始，并绘制了学习曲线，以查看模型是否过度拟合训练数据。我们观察到收敛的训练和验证误差彼此接近，这意味着我们可以使用随机森林，XG Boost和支持向量机等高方差算法，并且还可以使用大量正在使用的功能。

Cross-validation results for both the algorithms

两种算法的交叉验证结果

As expected, we got the best performance from XG Boost Classifier. We will further try hyperparameter tuning to improve the performance.

不出所料，我们从XG Boost分类器中获得了最佳性能。我们将进一步尝试超参数调整，以提高性能。

4. Results

4.结果

For final prediction we have to preprocess the whole test data-set. While encoding the feature columns we made sure that the one-hot encodings are same as the train set and feature hasher transformer used should be fitted on train data.

对于最终预测，我们必须预处理整个测试数据集。在对特征列进行编码时，我们确保单次编码与火车集相同，并且所使用的特征哈希变压器应安装在火车数据上。

Following are the Final result on the test data:

以下是测试数据的最终结果：

Final Evaluation on Test data

对测试数据的最终评估

5. Discussion

5.讨论

Many more analysis and methodologies can be added to this project as a future work. We haven’t used the coordinates. Those coordinates could result in some unforeseen clusters which could exponentially improve the study.

作为将来的工作，可以将更多的分析和方法学添加到该项目中。我们尚未使用坐标。这些坐标可能会导致一些无法预料的类群，从而可以成倍地改善研究。

Further other encoding techniques can be used in place of feature hashing or feature hashing with different feature count can be used. The performance of these changes can be evaluated using cross-validation.

可以使用其他编码技术来代替特征哈希，或者可以使用具有不同特征计数的特征哈希。可以使用交叉验证来评估这些更改的性能。

6. Conclusion

6.结论

The results are satisfactory but expectations were much higher. A lot of improvement can be done on class 2 predictions. Overall a lot of improvement can be observed from the basic model.

结果令人满意，但期望值更高。在2类预测上可以做很多改进。总体而言，可以从基本模型中看到很多改进。