kaggle 旧金山犯罪案件分类预测

kaggle上对旧金山城市的犯罪案件进行分类,属于多分类问题,提供的数据特征包含时间、地点、描述等。

导入数据和包

#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time as systime
import datetime as dt
import string
import seaborn as sns
import matplotlib.colors as colors
%matplotlib inline
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
 
 
  • 1
  • 2
train.info()
 
 
  • 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 60.3+ MB

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
train.shape
 
 
  • 1
(878049, 9)
train.head(3)
 
 
  • 1
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
DatesCategoryDescriptDayOfWeekPdDistrictResolutionAddressXY
02015-05-13 23:53:00WARRANTSWARRANT ARRESTWednesdayNORTHERNARREST, BOOKEDOAK ST / LAGUNA ST-122.42589237.774599
12015-05-13 23:53:00OTHER OFFENSESTRAFFIC VIOLATION ARRESTWednesdayNORTHERNARREST, BOOKEDOAK ST / LAGUNA ST-122.42589237.774599
22015-05-13 23:33:00OTHER OFFENSESTRAFFIC VIOLATION ARRESTWednesdayNORTHERNARREST, BOOKEDVANNESS AV / GREENWICH ST-122.42436337.800414
test.shape
 
 
  • 1
(884262, 7)
test.head(3)
 
 
  • 1
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
IdDatesDayOfWeekPdDistrictAddressXY
002015-05-10 23:59:00SundayBAYVIEW2000 Block of THOMAS AV-122.39958837.735051
112015-05-10 23:51:00SundayBAYVIEW3RD ST / REVERE AV-122.39152337.732432
222015-05-10 23:50:00SundayNORTHERN2000 Block of GOUGH ST-122.42600237.792212

数据分析

train.isnull().sum()
 
 
  • 1
Dates         0
Category      0
Descript      0
DayOfWeek     0
PdDistrict    0
Resolution    0
Address       0
X             0
Y             0
dtype: int64

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

Category

#可以通过groupby().size()方法返回分组后的统计结果
cate_group = train.groupby(by='Category').size()
cate_group
#Series类型
 
 
  • 1
  • 2
  • 3
  • 4

Category
ARSON 1513
ASSAULT 76876
BAD CHECKS 406
BRIBERY 289
BURGLARY 36755
DISORDERLY CONDUCT 4320
DRIVING UNDER THE INFLUENCE 2268
DRUG/NARCOTIC 53971
DRUNKENNESS 4280
EMBEZZLEMENT 1166
EXTORTION 256
FAMILY OFFENSES 491
FORGERY/COUNTERFEITING 10609
FRAUD 16679
GAMBLING 146
KIDNAPPING 2341
LARCENY/THEFT 174900
LIQUOR LAWS 1903
LOITERING 1225
MISSING PERSON 25989
NON-CRIMINAL 92304
OTHER OFFENSES 126182
PORNOGRAPHY/OBSCENE MAT 22
PROSTITUTION 7484
RECOVERED VEHICLE 3138
ROBBERY 23000
RUNAWAY 1946
SECONDARY CODES 9985
SEX OFFENSES FORCIBLE 4388
SEX OFFENSES NON FORCIBLE 148
STOLEN PROPERTY 4540
SUICIDE 508
SUSPICIOUS OCC 31414
TREA 6
TRESPASS 7326
VANDALISM 44725
VEHICLE THEFT 53781
WARRANTS 42214
WEAPON LAWS 8555
dtype: int64

#目标分类共有多少种类型
cat_num = len(cate_group.index)
cat_num
 
 
  • 1
  • 2
  • 3
39
cate_group.index = cate_group.index.map(string.capwords)
cate_group.sort_values(ascending=False,inplace=True)
 
 
  • 1
  • 2
cate_group.plot(kind='bar',logy=True,figsize=(15,10),color=sns.color_palette('coolwarm',cat_num))
plt.title('No. of Crime types',fontsize=20)
plt.show()
 
 
  • 1
  • 2
  • 3

png

注:请确保是最新的pandas版本,这样在画图的时候传入color或者colormap参数会得到不同颜色的柱状,我最初是0.20.0的版本,传入color参数得到的柱状颜色都是一样的,头疼了好一阵,最后在StackOverflow上找到相关问题,更新pandas版本到0.23.0后就解决了。
从上图看到,虽然数量下降曲线较平缓,但由于纵坐标是指数级的,可见数量较多的犯罪类型占比较大,从中可知该城市的主要犯罪类型集中在排名靠前的几类中。

sum(cate_group)#总共犯罪案件数量
 
 
  • 1
878049
top6 = list(cate_group.index[:6])
top15 = list(cate_group.index[:15])
total = sum(cate_group)

topsum = 0
for i in top6:
    topsum = cate_group[i]+topsum
print('Top6 crimes about:'+str(100*topsum/total)+'%'+' in total')

topsum=0
for i in top15:
    topsum+=cate_group[i]
print('Top15 crimes about:'+str(100*topsum/total)+'%'+' in total')
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

Top6 crimes about:65.8293557649% in total
Top15 crimes about:93.3187100037% in total

果然,60%以上的犯罪类型集中在前6种,90%以上的是前15种

PdDistrict

dis_group = train.groupby(by='PdDistrict').size()
print(len(dis_group))
dis_group
 
 
  • 1
  • 2
  • 3
10

 
 
  • 1

PdDistrict
BAYVIEW 89431
CENTRAL 85460
INGLESIDE 78845
MISSION 119908
NORTHERN 105296
PARK 49313
RICHMOND 45209
SOUTHERN 157182
TARAVAL 65596
TENDERLOIN 81809
dtype: int64

dis_group = dis_group/sum(dis_group)
 
 
  • 1
dis_group.index = dis_group.index.map(string.capwords)
dis_group.sort_values(ascending=True,inplace=True)
dis_group.plot(kind='barh',figsize=(15,10),fontsize=10,color=sns.color_palette('coolwarm',10))
plt.title('Frequncy. of crimes by district',fontsize=20)
plt.show()
 
 
  • 1
  • 2
  • 3
  • 4
  • 5

png

可以看出,地区之间差异还是挺大的,southern地区犯罪率较高,治安最好的是Richmond。

year/month/day

#将object类型转为datetime类型
train['date'] = pd.to_datetime(train['Dates'])
 
 
  • 1
  • 2
train.head(1)
 
 
  • 1
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
DatesCategoryDescriptDayOfWeekPdDistrictResolutionAddressXYdate
02015-05-13 23:53:00WARRANTSWARRANT ARRESTWednesdayNORTHERNARREST, BOOKEDOAK ST / LAGUNA ST-122.42589237.7745992015-05-13 23:53:00
train['year'] = train.date.dt.year
train['month'] = train.date.dt.month
train['day'] = train.date.dt.day
train['hour'] = train.date.dt.hour
 
 
  • 1
  • 2
  • 3
  • 4
plt.figure(figsize=(8,19))

year_group = train.groupby('year').size()
plt.subplot(311)
plt.plot(year_group.index[:-1],year_group[:-1],'ks-')
plt.xlabel('year')

month_group = train.groupby('month').size()
plt.subplot(312)
plt.plot(month_group,'ks-')
plt.xlabel('month')

day_group = train.groupby('day').size()
plt.subplot(313)
plt.plot(day_group,'ks-')
plt.xlabel('day')

plt.show()
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

png

从上图可知,在2010年前SF的犯罪数基本上呈递减趋势,2010后数量激增,案件高发期是在一年中的5月和10月,在每个月的月初和月末会有轻微涨幅。

Day of week

week_group = train.groupby(['DayOfWeek','hour']).size()#多重分组
week_group = week_group.unstack()#对分组后的多重索引转为xy索引

week_group.T.plot(figsize=(12,8))#行列互换后画图
plt.xlabel('hour of day',size=15)
plt.ylabel('Number of crimes',size=15)
plt.show()
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

png

可以看出,案件高发时间是在12点和18点左右,凌晨后数量会显著减少,在周五周六的晚上8点后案件发生率会比平时要高。

高发案件的时间和地点

对数量较多的前6种犯罪类型做分析:

hour

tmp = train[train['Category'].map(string.capwords).isin(top6)]
tmp_group = tmp.groupby(['Category','hour']).size()
tmp_group = tmp_group.unstack()
tmp_group.T.plot(figsize=(12,6),style='o-')
plt.show()
 
 
  • 1
  • 2
  • 3
  • 4
  • 5

png

时间上与上述分析是一致的,对于偷盗类案件在12、18点发生率更高;assault类案件在晚上6点后没有下降趋势。

PdDistrict

tmp2 = tmp.groupby(['Category','PdDistrict']).size()
tmp2.unstack()
 
 
  • 1
  • 2
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
PdDistrictBAYVIEWCENTRALINGLESIDEMISSIONNORTHERNPARKRICHMONDSOUTHERNTARAVALTENDERLOIN
Category
ASSAULT985769778533111498318351532021218354637679
DRUG/NARCOTIC4498180523738757451125739999228153117696
LARCENY/THEFT10119250601023618223286309146989341845118459903
NON-CRIMINAL60991094068531237210240592557441974569197467
OTHER OFFENSES1705389011320319330122336184563221308861413724
VEHICLE THEFT7219421089607148629139634117472561421006
tmp2.unstack().T.plot(kind='bar',figsize=(12,6),rot=45)
plt.show()
 
 
  • 1
  • 2

png

从上图可知,犯罪率最高的Southern地区,偷窃类、暴力冲突类案件数量最多,车辆失窃类案件较少,猜测可能属于贫困地区,治安很好的地区Park,Richmond中,毒品、人身攻击类案件比例明显较少.

DayOfWeek

tmp3 = tmp.groupby(['Category','DayOfWeek']).size()
tmp3 = tmp3.unstack()
 
 
  • 1
  • 2
tmp3.sum(axis=1)[0]
 
 
  • 1
76876
tmp3.iloc[0]
 
 
  • 1

DayOfWeek
Friday 11160
Monday 10560
Saturday 11995
Sunday 12082
Thursday 10246
Tuesday 10280
Wednesday 10553
Name: ASSAULT, dtype: int64

for i in range(6):
    tmp3.iloc[i] = tmp3.iloc[i]/tmp3.sum(axis=1)[i]
tmp3
 
 
  • 1
  • 2
  • 3
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
DayOfWeekFridayMondaySaturdaySundayThursdayTuesdayWednesday
Category
ASSAULT0.1451690.1373640.1560300.1571620.1332800.1337220.137273
DRUG/NARCOTIC0.1374810.1449480.1183970.1138200.1566400.1570100.171703
LARCENY/THEFT0.1549690.1347630.1556150.1380790.1395940.1369750.140006
NON-CRIMINAL0.1514990.1392680.1517490.1405460.1388780.1380010.140059
OTHER OFFENSES0.1473110.1409630.1357480.1224980.1463120.1490620.158105
VEHICLE THEFT0.1601490.1378180.1509640.1395290.1386360.1350480.137855
wkm = {
    'Monday':0,
    'Tuesday':1,
    'Wednesday':2,
    'Thursday':3,
    'Friday':4,
    'Saturday':5,
    'Sunday':6
}
tmp3.columns = tmp3.columns.map(wkm)
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
tmp3 = tmp3.ix[:,range(7)]
tmp3
 
 
  • 1
  • 2
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  “”“Entry point for launching an IPython kernel.
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
DayOfWeek0123456
Category
ASSAULT0.1373640.1337220.1372730.1332800.1451690.1560300.157162
DRUG/NARCOTIC0.1449480.1570100.1717030.1566400.1374810.1183970.113820
LARCENY/THEFT0.1347630.1369750.1400060.1395940.1549690.1556150.138079
NON-CRIMINAL0.1392680.1380010.1400590.1388780.1514990.1517490.140546
OTHER OFFENSES0.1409630.1490620.1581050.1463120.1473110.1357480.122498
VEHICLE THEFT0.1378180.1350480.1378550.1386360.1601490.1509640.139529
tmp3.T.plot(figsize=(12,6),style='o-')
plt.xlabel("weekday",size=20)
#plt.axes.set_xticks([])
plt.xticks([0,1,2,3,4,5,6],['Mon','Tue','Wed','Thur','Fri','Sat','Sun'])
plt.show()
 
 
  • 1
  • 2
  • 3
  • 4
  • 5

png

趋势不太一样的是毒品类案件,在周三发生最多,周末有急剧下降的趋势;其余多数案件,除了other offenses外,都在周五周六有所增多。

month

mon_g = tmp.groupby(['Category','month']).size()
mon_g = mon_g.unstack()
for i in range(6):
    mon_g.iloc[i] = mon_g.iloc[i]/mon_g.sum(axis=1)[i]
mon_g.T.plot(figsize=(12,6),style='o-')
plt.show()
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

png

分类变化趋势与总体基本一致,2-6月和8-12月是案件高发期,1-2月drug和other offense案发率较高。

高发案件的时间趋势

ddf = tmp.groupby(['Category',pd.Grouper('date')]).size()
ddf = ddf.unstack().fillna(0)
 
 
  • 1
  • 2
ddf = ddf.T#将时间序列设为index方便后续使用resample进行统计
ddf.index
 
 
  • 1
  • 2

DatetimeIndex([‘2015-05-13 23:53:00’, ‘2015-05-13 23:33:00’,
‘2015-05-13 23:30:00’, ‘2015-05-13 23:00:00’,
‘2015-05-13 22:58:00’, ‘2015-05-13 22:30:00’,
‘2015-05-13 22:06:00’, ‘2015-05-13 22:00:00’,
‘2015-05-13 21:55:00’, ‘2015-05-13 21:40:00’,

‘2003-01-06 02:00:00’, ‘2003-01-06 01:54:00’,
‘2003-01-06 01:50:00’, ‘2003-01-06 01:36:00’,
‘2003-01-06 00:55:00’, ‘2003-01-06 00:40:00’,
‘2003-01-06 00:33:00’, ‘2003-01-06 00:31:00’,
‘2003-01-06 00:20:00’, ‘2003-01-06 00:01:00’],
dtype=’datetime64[ns]’, name=’date’, length=306742, freq=None)

df2 = ddf.resample('m',how='sum')#按月求和
 
 
  • 1
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(…).sum()
  “”“Entry point for launching an IPython kernel.
plt.style.use('ggplot')
moav = df2.rolling(12).mean()#每12个月统计平均,相当于加了个窗
i = 1
for cat in df2.columns:
    plt.figure(figsize=(12,15))
    ax = plt.subplot(6,1,i)
    plt.plot(df2.index,df2[cat])
    plt.plot(df2.index,moav[cat])
    plt.title(cat)
    i+=1
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

png

png

png

png

png

png

df2.plot()
 
 
  • 1

png

可见,不同种类的案件随时间是有不同变化的,如vehicle theft在05年后急剧下降,可能有专项整治等活动,theft却在12年后有升高的趋势。

地图坐标展示

在给的训练和测试数据最后,有2列是代表犯罪案件发生的经纬度坐标,从上面分析知道有些地区是案件高发区,有些地区某类案件比例较高,所以可知,地理位置和案件分类有较强的关系,我们以地图的形式展示某些案件的高发地区。

train[['X','Y']].describe()
 
 
  • 1
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
XY
count878049.000000878049.000000
mean-122.42261637.771020
std0.0303540.456893
min-122.51364237.707879
25%-122.43295237.752427
50%-122.41642037.775421
75%-122.40695937.784369
max-120.50000090.000000
#show SF map
mapdata = np.loadtxt('sf_map_copyright_openstreetmap_contributors.txt')
plt.figure(figsize=(8,8))
plt.imshow(mapdata,cmap=plt.get_cmap('gray'))
plt.show()
 
 
  • 1
  • 2
  • 3
  • 4
  • 5

png

#我们选取数量最多的偷盗类案件
theft=train[train['Category']=='LARCENY/THEFT']
 
 
  • 1
  • 2
#我的电脑用所有训练数据画图时,时间特别长,所以这里选取部分数据,并去除可能不正确的数据
#theft['Xok'] = theft[theft.X<-121].X
#theft['Yok'] = theft[theft.Y>40].Y
theft = theft[1:300000]
 
 
  • 1
  • 2
  • 3
  • 4
asp = mapdata.shape[0]*1.0/mapdata.shape[1]
lon_lat_box = (-122.5247, -122.3366, 37.699, 37.8299)
clipsize = [[-122.5247, -122.3366],[ 37.699, 37.8299]]

plt.figure(figsize=(8,8*asp))
ax = sns.kdeplot(theft.X,theft.Y,clip=clipsize,aspect=1/asp)
#ax = sns.regplot('X', 'Y', data=theft, fit_reg=False)
ax.imshow(mapdata,cmap=plt.get_cmap('gray'),extent=lon_lat_box,aspect=asp)
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
<matplotlib.image.AxesImage at 0x1f6a1ec4828>

 
 
  • 1

png

im = plt.imread('SanFranMap.png')
plt.figure(figsize=(8,8))
ax = sns.kdeplot(theft.X,theft.Y,clip=clipsize,aspect=1/asp)
#ax = sns.regplot('X', 'Y', data=theft, fit_reg=False)
ax.imshow(im,cmap=plt.get_cmap('gray'),extent=lon_lat_box,aspect=asp)
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
<matplotlib.image.AxesImage at 0x1f6a22434e0>

 
 
  • 1

png

因为只是在一个city,可以看出XY(经纬坐标)范围很小,数值型数据再经过标准化处理后,其指示的地域范围就很模糊了,分类效果不明显,但地理位置对案件类型还是有影响的,我们暂且选用PdDistrict。

数据处理

类别特征:Dates,Descript,DayOfWeek,PdDistrict,Resolution,Address
数值型特征:X,Y,year,month,day,hour
时间特征:date

from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.cross_validation import train_test_split
#from sklearn.feature_selection import SelectKBest
#from sklearn.feature_selection import chi2
from sklearn.cross_validation import train_test_split
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
D:\programs\anaconda\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

 
 
  • 1
  • 2
#对测试集的Dates做同样的处理
test['date'] = pd.to_datetime(test['Dates'])
test['year'] = test.date.dt.year
test['month'] = test.date.dt.month
test['day'] = test.date.dt.day
test['hour'] = test.date.dt.hour
test.info()
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884262 entries, 0 to 884261
Data columns (total 12 columns):
Id            884262 non-null int64
Dates         884262 non-null object
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       884262 non-null object
X             884262 non-null float64
Y             884262 non-null float64
date          884262 non-null datetime64[ns]
year          884262 non-null int64
month         884262 non-null int64
day           884262 non-null int64
hour          884262 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(5), object(4)
memory usage: 81.0+ MB

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
train.info()
 
 
  • 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 14 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
date          878049 non-null datetime64[ns]
year          878049 non-null int64
month         878049 non-null int64
day           878049 non-null int64
hour          878049 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(4), object(7)
memory usage: 93.8+ MB

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

预测类别标签

#对分类目标做标签化处理

label = preprocessing.LabelEncoder()
target = label.fit_transform(train.Category)
target
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
array([37, 21, 21, ..., 16, 35, 12], dtype=int64)

 
 
  • 1
#处理不统一的特征
Id = test['Id']
des = train['Descript']
res = train['Resolution']
train.drop(['Category','Descript','Resolution'],axis=1,inplace=True)
test.drop('Id',axis=1,inplace=True)
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
#合并数据方便处理
full = pd.concat([train,test],keys=['train','test'])
 
 
  • 1
  • 2
full.info()
 
 
  • 1
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1762311 entries, (train, 0) to (test, 884261)
Data columns (total 11 columns):
Dates         object
DayOfWeek     object
PdDistrict    object
Address       object
X             float64
Y             float64
date          datetime64[ns]
year          int64
month         int64
day           int64
hour          int64
dtypes: datetime64[ns](1), float64(2), int64(4), object(4)
memory usage: 163.0+ MB

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

选取特征

#对DayOfWeek做one-hot编码转为数值型
week = pd.get_dummies(full.DayOfWeek)
 
 
  • 1
  • 2
#PdDistrict和Address重复
#选择PdDistrict并做处理
full.drop('Address',axis=1,inplace=True)
dist = pd.get_dummies(full.PdDistrict)
 
 
  • 1
  • 2
  • 3
  • 4
#时间特征
#删除重复的Dates,date
full.drop(['Dates','date'],axis=1,inplace=True)
 
 
  • 1
  • 2
  • 3

对数值型时间特征year month day hour,不同类型案件的year趋势不一样,month特征在年初会有不同,hour特征在18点后会有不同,所以添加2个新特征newy,dark.


full['newy'] = full['month'].apply(lambda x:1 if x==1 or x==2 else 0)
full['dark'] = full['hour'].apply(lambda x:1 if x>=18 and x<=24 else 0)

 
 
  • 1
  • 2
  • 3
  • 4
hour_dum = pd.get_dummies(full.hour)
 
 
  • 1
year_dum = pd.get_dummies(full.year)
 
 
  • 1
month_dum = pd.get_dummies(full.month)
 
 
  • 1

#删除、合并特征
full.drop(['month','hour','day','year','DayOfWeek','PdDistrict'],axis=1,inplace=True)

#full = pd.concat(['week','dist','year'],axis=1)
#full.drop('year',axis=1,inplace=True)
full = pd.concat([full,week,dist,year_dum,month_dum,hour_dum,],axis=1)

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
#full.drop(['month','hour','day','year','DayOfWeek','PdDistrict'],axis=1,inplace=True)
#full = pd.concat([full,week,dist,year_dum,hour_dum,month_dum],axis=1)
 
 
  • 1
  • 2
full.isnull().sum()
 
 
  • 1
newy          0
dark          0
Friday        0
Monday        0
Saturday      0
Sunday        0
Thursday      0
Tuesday       0
Wednesday     0
BAYVIEW       0
CENTRAL       0
INGLESIDE     0
MISSION       0
NORTHERN      0
PARK          0
RICHMOND      0
SOUTHERN      0
TARAVAL       0
TENDERLOIN    0
2003          0
2004          0
2005          0
2006          0
2007          0
2008          0
2009          0
2010          0
2011          0
             ..
7             0
8             0
9             0
10            0
11            0
12            0
0             0
1             0
2             0
3             0
4             0
5             0
6             0
7             0
8             0
9             0
10            0
11            0
12            0
13            0
14            0
15            0
16            0
17            0
18            0
19            0
20            0
21            0
22            0
23            0
Length: 70, dtype: int64

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60

生成验证集、测试集

#加入所有特征
training,valid,y_train,y_valid = train_test_split(full[:train.shape[0]],target,train_size=0.7,random_state=0)
 
 
  • 1
  • 2
training.shape
 
 
  • 1
(614634, 68)

 
 
  • 1

model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.naive_bayes import BernoulliNB
import time
 
 
  • 1
  • 2
  • 3
  • 4
training.shape
 
 
  • 1
(614634, 68)

 
 
  • 1

逻辑回归

LR = LogisticRegression(C=0.1)
lrstart = time.time()
LR.fit(training, y_train)
lrcost_time = time.time()-lrstart
predicted = np.array(LR.predict_proba(valid))
print("逻辑回归log损失为 %f" %(log_loss(y_valid, predicted)))
print('逻辑回归建模耗时 %f 秒' %(lrcost_time))
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
逻辑回归log损失为 2.596991
逻辑回归建模耗时 130.701451 秒

 
 
  • 1
  • 2

朴素贝叶斯

NB = BernoulliNB()
nbstart = time.time()
NB.fit(training,y_train)
nbcost_time = time.time()-nbstart
predicted = np.array(NB.predict_proba(valid))
print("贝叶斯log损失为 %f" %(log_loss(y_valid, predicted)))
print( "朴素贝叶斯建模耗时 %f 秒" %(nbcost_time))
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
贝叶斯log损失为 2.607965
朴素贝叶斯建模耗时 1.765910 秒

 
 
  • 1
  • 2
train_all = np.c_[training,y_train]
train_all.shape
 
 
  • 1
  • 2
(614634, 69)

 
 
  • 1
np.savetxt('/forBP/train.csv',train_all,fmt='%d',delimiter=',')
 
 
  • 1

随机森林

from sklearn.ensemble import RandomForestClassifier

params = [12,13,14,15,16]
for par in params:
    clf = RandomForestClassifier(n_estimators=30, max_depth=par)
#forest_start = time.time()
    clf.fit(training,y_train)
#fcost = time.time()-forest_start
    predicted = np.array(clf.predict_proba(valid))
    print("随机森林log损失为 %f" %(log_loss(y_valid, predicted)))
#print( "随机森林建模耗时 %f 秒" %(fcost))
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
随机森林log损失为 2.575974
随机森林log损失为 2.568528
随机森林log损失为 2.563786
随机森林log损失为 2.559156
随机森林log损失为 2.555832

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
#write the result
result = NB.predict_proba(full[train.shape[0]:])
submission = pd.DataFrame(result,columns=label.classes_)
submission.to_csv('SFresult_v1.csv',index = False, index_label='Id' )
 
 
  • 1
  • 2
  • 3
  • 4
submission.shape
 
 
  • 1
(884262, 39)

 
 
  • 1

这里使用了逻辑回归、贝叶斯分类和随机森林,目标损失函数是log loss,贝叶斯分类时间快,精确度也较高,使用集成学习器后能大大提高准确度,所以下一步可以考虑使用不同的集成学习器,或者对随机森林的参数进行调优。
我这里选择的特征方式比较简单,包含时间和低点,都是非数值型特征,只是简单做了onehot编码,下一步可以考虑加入PCA进行降维,或者重新选择和构造新特征。对于文本特征discription,这里没有使用,可以借助这个特征进行文本分类预测,或者通过分析关键词,对other offense类案件有更清晰的了解。
虽然样板数量较之前的比赛有了提升,但特征量并不算多,下一步我考虑使用TensorFlow对处理好的数据进行BP神经网络预测。


新手学习,欢迎指教!!!

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python Kaggle汽车价格数据分类预测是使用Python编程语言来分析和预测Kaggle汽车价格数据集中的汽车价格分类Kaggle是一个数据科学和机器学习竞赛网站,提供了各种数据集供研究者和机器学习爱好者使用。 通过使用Python编程语言,可以使用各种数据处理和机器学习库来进行数据的清洗、特征工程和建模。首先,我们可以使用Pandas库来加载和处理数据集。可以对数据进行预处理,如处理缺失值、处理异常值、转换数据类型等。 接下来,可以使用Matplotlib或Seaborn库来进行数据的可视化分析,以了解数据集的特征和分布情况。例如,可以绘制汽车价格与其它特征之间的关系图,如汽车品牌、车型、年份、里程等。 然后,可以使用Scikit-learn库来进行机器学习模型的建模和预测。可以使用各种分类算法,如决策树、随机森林、支持向量机等。可以将数据集分为训练集和测试集,训练模型并进行交叉验证,选择最佳的模型并进行参数调优。 最后,可以使用训练好的模型对新的汽车数据进行分类预测。可以使用测试集来评估模型的性能,如准确率、召回率、F1值等。也可以使用一些评估指标来评估模型的预测能力,如混淆矩阵、ROC曲线、AUC值等。 通过使用Python Kaggle汽车价格数据分类预测,可以有效地分析和预测汽车价格分类,为汽车行业的定价和销售决策提供有力的支持。同时,也可以提高我们对数据分析和机器学习的理解和实践能力。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值