Kaggle竞赛：San Francisco Crime Classification（旧金山犯罪分类）参赛心得

最新推荐文章于 2023-04-23 10:56:53 发布

陈宸-研究僧

最新推荐文章于 2023-04-23 10:56:53 发布

阅读量7.0k

点赞数 7

分类专栏： kaggle竞赛

本文链接：https://blog.csdn.net/qq_35883464/article/details/90550725

版权

3.1 按‘year’ 和‘month’ 类型统计

3.2 按‘DayOfWeek’和‘hour’类型统计

（1）卡方检验：检验定性自变量对定性因变量的相关性

（2）皮尔森相关系数

4.2 平均不纯度减少 (Mean Decrease Impurity)

4.3 平均精确率减少 (Mean Decrease Accuracy)

6.1 加入新特征：每个地点的犯罪率、每个地点的犯某一种罪的犯罪率

6.2 特征 HourZn, MonthZn, X, Y是否标准化 preprocessing.scale()

数据源下载：Kaggle旧金山犯罪类型分类San Francisco Crime Classification.zip

下面的所有代码都只是为了说明而放置的一些关键性代码

完整代码和相关论文：https://github.com/455125158/kaggle

一、项目概述

kaggle：

背景：从1934年到1963年，旧金山因高犯罪而臭名昭著。时至今日，旧金山虽以高科技著称于世，但犯罪率扔高居不下。

目的：次数据禁赛提供了近12年整个湾区的犯罪记录，需要我们做的是预测犯罪的类型，对于犯罪把控和警方布置有实际意义。

数据集的时间是：2003-1-1到2015-5-13

数据集包含训练集（大小：22M）和数据集（大小：18.75M）2部分

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay. From Sunset to SOMA, and Marina to Excelsior, this dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods.

This dataset was featured in our completed playground competition entitled San Francisco Crime Classification. The goals of the competition were to:

predict the category of crime that occurred, given the time and location visualize the city and crimes (see Mapping and Visualizing Violent Crime for inspiration) Content

二、数据预处理

2.1 特征项

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. There are 9 variables:

Dates - 犯罪事件的时间
Category - 犯罪事件类别（要预测的目标变量）
Descript - 犯罪事件的详细描述（只在训练集上）不使用
DayOfWeek - 星期几
PdDistrict - 出警的警局名称
Resolution - 犯罪事件是如何解决的（只在训练集上）不使用
Address - 犯罪事件的大致街道地址
X - 经度
Y - 纬度

先用热力图看看相关性：

plt.figure(figsize=(20,10))
cor = train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

从图中可以看出，除了X、Y有强相关性，其他特征之间没有回你强的相关性，所以我们只需把X、Y特征当成一个特征即可

2.3 非数值特征数值化

（1）对category列进行编号

先看看数据：

# Plot Bar Chart visualize Crime Types
plt.figure(figsize=(14,10))
plt.title('犯罪数量分析')
plt.ylabel('犯罪种类')
plt.xlabel('犯罪数量')

train.groupby([train['Category']]).size().sort_values(ascending=True).plot(kind='barh')

plt.show()

由图可知：category分布很不均匀，有的犯罪类型数量多，如LARCENY/THEFT、 OTHER OFFEBSES等；有的犯罪类型数量少，如TREA、 SEX OFFENSES NON FORCIBLE等。

使用one-hot-code列进行编号，再归一化

# one-hot-code
    dummy_dayofweef = pd.get_dummies(data['DayOfWeek'],prefix='wday')
    data = data.join(dummy_dayofweef)

（2）对DayOfWeek列进行编号

还是看看数据：

plt.figure(figsize=(14,10))
plt.title('星期')

train.groupby([train['DayOfWeek']]).size().sort_values(ascending=True).plot(kind='barh')

plt.show()

还是比较均匀的，对DayOfWeek列进行编号

 weekdays = {'Monday':0., 'Tuesday':1., 'Wednesday':2., 'Thursday': 3., 'Friday':4., 'Saturday':5., 'Sunday':6.}
 Week = pd.DataFrame([float(weekdays[w]) for w in data.DayOfWeek]) #日期

（3）对PdDistinct列进行编号

不同的警局出警次数也不均匀，成功率也不一样，读者可以自己往这方面思考，后面会有这个特征的处理

PdDistrict_Num = pd.DataFrame([float(districts[t]) for t in data.PdDistrict]) #街道编码

三、特征分析

3.1 按‘year’ 和‘month’ 类型统计

可以发现犯罪数量呈季节性变化，夏季和冬季数量少于春季和秋季。因此我们可以加上 “季节” 特征列。

def getMonthZn(month):
    if(month < 3 or month >= 12): return 1; #冬
    if(month >= 3 and month < 6): return 2; #春
    if(month >= 6 and month < 9): return 3; #夏
    if(month >= 9 and month < 12): return 4; #秋

最低0.47元/天解锁文章

陈宸-研究僧

关注

7
点赞
踩
58

收藏

觉得还不错? 一键收藏
2
评论
Kaggle竞赛：San Francisco Crime Classification（旧金山犯罪分类）参赛心得

目录一、项目概述二、数据预处理2.1 特征项2.3 非数值特征数值化（1）对category列进行编号（2）对DayOfWeek列进行编号（3）对PdDistinct列进行编号三、特征分析3.1 按‘year’ 和‘month’ 类型统计3.2 按‘DayOfWeek’和‘hour’类型统计3.3 Address列3.4 经度X和纬度Y四、特...
复制链接

扫一扫