kaggle上对旧金山城市的犯罪案件进行分类,属于多分类问题,提供的数据特征包含时间、地点、描述等。
导入数据和包
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time as systime
import datetime as dt
import string
import seaborn as sns
import matplotlib.colors as colors
%matplotlib inline
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates 878049 non-null object
Category 878049 non-null object
Descript 878049 non-null object
DayOfWeek 878049 non-null object
PdDistrict 878049 non-null object
Resolution 878049 non-null object
Address 878049 non-null object
X 878049 non-null float64
Y 878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 60.3+ MB
train.shape
(878049, 9)
train.head(3)
Dates | Category | Descript | DayOfWeek | PdDistrict | Resolution | Address | X | Y | |
---|---|---|---|---|---|---|---|---|---|
0 | 2015-05-13 23:53:00 | WARRANTS | WARRANT ARREST | Wednesday | NORTHERN | ARREST, BOOKED | OAK ST / LAGUNA ST | -122.425892 | 37.774599 |
1 | 2015-05-13 23:53:00 | OTHER OFFENSES | TRAFFIC VIOLATION ARREST | Wednesday | NORTHERN | ARREST, BOOKED | OAK ST / LAGUNA ST | -122.425892 | 37.774599 |
2 | 2015-05-13 23:33:00 | OTHER OFFENSES | TRAFFIC VIOLATION ARREST | Wednesday | NORTHERN | ARREST, BOOKED | VANNESS AV / GREENWICH ST | -122.424363 | 37.800414 |
test.shape
(884262, 7)
test.head(3)
Id | Dates | DayOfWeek | PdDistrict | Address | X | Y | |
---|---|---|---|---|---|---|---|
0 | 0 | 2015-05-10 23:59:00 | Sunday | BAYVIEW | 2000 Block of THOMAS AV | -122.399588 | 37.735051 |
1 | 1 | 2015-05-10 23:51:00 | Sunday | BAYVIEW | 3RD ST / REVERE AV | -122.391523 | 37.732432 |
2 | 2 | 2015-05-10 23:50:00 | Sunday | NORTHERN | 2000 Block of GOUGH ST | -122.426002 | 37.792212 |
数据分析
train.isnull().sum()
Dates 0
Category 0
Descript 0
DayOfWeek 0
PdDistrict 0
Resolution 0
Address 0
X 0
Y 0
dtype: int64
Category
#可以通过groupby().size()方法返回分组后的统计结果
cate_group = train.groupby(by='Category').size()
cate_group
#Series类型
Category
ARSON 1513
ASSAULT 76876
BAD CHECKS 406
BRIBERY 289
BURGLARY 36755
DISORDERLY CONDUCT 4320
DRIVING UNDER THE INFLUENCE 2268
DRUG/NARCOTIC 53971
DRUNKENNESS 4280
EMBEZZLEMENT 1166
EXTORTION 256
FAMILY OFFENSES 491
FORGERY/COUNTERFEITING 10609
FRAUD 16679
GAMBLING 146
KIDNAPPING 2341
LARCENY/THEFT 174900
LIQUOR LAWS 1903
LOITERING 1225
MISSING PERSON 25989
NON-CRIMINAL 92304
OTHER OFFENSES 126182
PORNOGRAPHY/OBSCENE MAT 22
PROSTITUTION 7484
RECOVERED VEHICLE 3138
ROBBERY 23000
RUNAWAY 1946
SECONDARY CODES 9985
SEX OFFENSES FORCIBLE 4388
SEX OFFENSES NON FORCIBLE 148
STOLEN PROPERTY 4540
SUICIDE 508
SUSPICIOUS OCC 31414
TREA 6
TRESPASS 7326
VANDALISM 44725
VEHICLE THEFT 53781
WARRANTS 42214
WEAPON LAWS 8555
dtype: int64
#目标分类共有多少种类型
cat_num = len(cate_group.index)
cat_num
39
cate_group.index = cate_group.index.map(string.capwords)
cate_group.sort_values(ascending=False,inplace=True)
cate_group.plot(kind='bar',logy=True,figsize=(15,10),color=sns.color_palette('coolwarm',cat_num))
plt.title('No. of Crime types',fontsize=20)
plt.show()
注:请确保是最新的pandas版本,这样在画图的时候传入color或者colormap参数会得到不同颜色的柱状,我最初是0.20.0的版本,传入color参数得到的柱状颜色都是一样的,头疼了好一阵,最后在StackOverflow上找到相关问题,更新pandas版本到0.23.0后就解决了。
从上图看到,虽然数量下降曲线较平缓,但由于纵坐标是指数级的,可见数量较多的犯罪类型占比较大,从中可知该城市的主要犯罪类型集中在排名靠前的几类中。
sum(cate_group)#总共犯罪案件数量
878049
top6 = list(cate_group.index[:6])
top15 = list(cate_group.index[:15])
total = sum(cate_group)
topsum = 0
for i in top6:
topsum = cate_group[i]+topsum
print('Top6 crimes about:'+str(100*topsum/total)+'%'+' in total')
topsum=0
for i in top15:
topsum+=cate_group[i]
print('Top15 crimes about:'+str(100*topsum/total)+'%'+' in total')
Top6 crimes about:65.8293557649% in total
Top15 crimes about:93.3187100037% in total
果然,60%以上的犯罪类型集中在前6种,90%以上的是前15种
PdDistrict
dis_group = train.groupby(by='PdDistrict').size()
print(len(dis_group))
dis_group
10
PdDistrict
BAYVIEW 89431
CENTRAL 85460
INGLESIDE 78845
MISSION 119908
NORTHERN 105296
PARK 49313
RICHMOND 45209
SOUTHERN 157182
TARAVAL 65596
TENDERLOIN 81809
dtype: int64
dis_group = dis_group/sum(dis_group)
dis_group.index = dis_group.index.map(string.capwords)
dis_group.sort_values(ascending=True,inplace=True)
dis_group.plot(kind='barh',figsize=(15,10),fontsize=10,color=sns.color_palette(