赛题介绍
根据用户信息与活动(event)信息,预测用户将对哪些活动感兴趣。
数据集
共有六个文件:train.csv,test.csv, users.csv,user_friends.csv,events.csv和 event_attendees.csv。
train.csv 包含六列:
user:用户id
event:活动id
invite:是否被邀请
timestamp:时间戳
interested:
not_interested
test.csv 包含四列(与train的属性相同,但没有interested和not_interested)。
users.csv 包含七列
user_id:用户的ID
locale:用户区域
birthyear:用户出生的年份
gender:性别
joinedAt:首次使用APP的时间
location:用户位置
timezone:UTC偏移量
user_friends.csv包含有关此用户的社交数据,包含两列:user和friends。
user:用户的id,
friends:用户朋友ID(以空格分隔)。
events.csv 包含有关活动的数据,有110列。前九列是 event_id,user_id,start_time,city,state,zip,country, lat和lng
event_id:活动id
user_id:创建活动的用户的id
start_time:开始时间
city、state、zip、country:活动场地详细信息
lat和lng:经纬度
count_1, count_2,…, count_100 表示:统计了活动名称或描述中出现的100个最常见的词干,统计它们出现的频率(会把时态语态都去掉, 对词做了词频的排序(表示前N个最常出现的词,每个词在活动中出现的频次。)
count_other: count_other 是其余词的统计。
event_attendees.csv包含有关哪些用户参加了各种事件的信息,并包含以下列: event_id,yes,maybe,invite和no。
event_id:活动id
yes:会参加的用户
maybe:可能参加的用户
invite:邀请的用户
no:不会参加的用户
所以,总的来说包括三类数据:
- 用户信息
- 用户社交信息
- 活动本身信息
在做后续操作前,对于数据的探索是必须的,本文不再赘述。
基于上面的简单介绍,将每个文件导入了解其基本构造,加深对于数据的理解。这一步非常重要!!!`
简单思考:
1.训练集上维度很少,需要根据给的user及event等相关数据构建更多维度
2.协同过滤是基于user-event 历史交互数
3.需要把社交数据和event相关信息作为影响最后结果的因素纳入考量
4.视作分类模型,每一个人感兴趣/不感兴趣是target,其他影响结果的是feature.
5.影响结果的feature包括由协同过滤产出的推荐度
初步思路简图:
由于整体数据量比较大,而且需要根据user和event数据构建新的训练特征。故下面分几步来操作,中间会逐步处理一些数据并保存,待需要时再提取。
共分以下几步:
1.处理user和event,并根据训练数据来构建协同过滤评分体系
2.根据user数据提取用户相似度
3.根据用户社交数据提取构建相关特征
4.构建event相似度数据
5.event本身的热度
6.根据上面构建的数据,结合训练及测试数据。提取相关特征,构成最终要用的训练集和测试集
7.对上一步的训练集构建模型
一 处理user和event基础数据
#导入需要用到的库,后续每一步如果分别做的话,都要导入这些库。
#本文后续不重复导入。
import numpy as np
import pandas as pd
import datetime
import itertools
import hashlib
import _pickle
import scipy.io as sio
import scipy.sparse as ss
import scipy.spatial.distance as ssd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
from sklearn.preprocessing import normalize
#导入训练数据及测试数据
train=pd.read_csv(r'E:\python\all\first\train.csv')
test=pd.read_csv(r'E:\python\all\first\test.csv')
#合并train和test中user和event以利于提取不重复值
data=pd.concat([train[['user','event']],test[['user','event']]],axis=0)
#提取train和test中user和event不重复的集合
uniqueUsers=set(data['user'])
uniqueEvents=set(data['event'])
#构建user-events及event-users集合
data=data.reset_index(drop=True)
eventsForUser=defaultdict(set)
usersForEvent=defaultdict(set)
for i in range(len(data)):
eventsForUser[data['user'][i]].add(data['event'][i])
usersForEvent[data['event'][i]].add(data['user'][i])
#构建user-index和event-index集合
userIndex=dict()
eventIndex=dict()
for i,u in enumerate(uniqueUsers):
userIndex[u]=i
for i,e in enumerate(uniqueEvents):
eventIndex[e]=i
#构建user-event评分矩阵,评分取值(-1/0/1)
userEventScores=ss.dok_matrix((len(uniqueUsers),len(uniqueEvents)))
for index in range(len(train)):
i=userIndex[train['user'][index]]
j=eventIndex[train['event'][index]]
userEventScores[i,j]=int(train['interested'][index])-int(train['not_interested'][index])
#保存user-event评分矩阵
sio.mmwrite(r'E:\python\all\third\PE_userEventScores',userEventScores)
#提取user及event相关集合,两两配对
uniqueUserPairs=set()
uniqueEventPairs=set()
for event in uniqueEvents:
users=usersForEvent[event]
if len(users)>2:
uniqueUserPairs.update(itertools.combinations(users,2))
for user in uniqueUsers:
events=eventsForUser[user]
if len(events)>2:
uniqueEventPairs.update(itertools.combinations(events,2))
#保存相关文件
_pickle.dump(userIndex,open(r'E:\python\all\third\PE_userIndex.pkl','wb'))
_pickle.dump(eventIndex,open(r'E:\python\all\third\PE_eventIndex.pkl','wb'))
_pickle.dump(uniqueUserPairs,open(r'E:\python\all\secend\PE_uniqueUserPairs.pkl','wb'))
_pickle.dump(uniqueEventPairs,open(r'E:\python\all\secend\PE_uniqueEventPairs.pkl','wb'))
上述代码统计了所有的train 和 test中所有的user和event,同时构建了user对event感兴趣程度的矩阵(数值为1 0 -1,类似于对活动的打分)
本步骤构建的一些集合及矩阵非常重要,一定要理解:
用户总数:3391
users总数:13418
uniqueUsers:集合,保存train.csv和test.csv中的所有user ID
uniqueEvents:集合,保存train.csv和test.csv中的所有event ID
userIndex:字典,每个用户有个Index
eventIndex:字典,每个event有个Index
eventsForUser:字典,key为每个用户,value为该用户对应的event集合
usersForEvent:字典,key为每个event,value为该event对应的user集合
userEventScores:稀疏矩阵,只保存了非0值
二 构建用户相似度矩阵
#导入users数据
df_users=pd.read_csv(r'E:\python\all\first\users.csv')
#描述性特征
df_users.head()
df_users.info()
#用户矩阵预处理
#缺失值处理
df_users.isnull().sum()
#1.gender属性的处理(类别数据,none值作为一类)
df_users['gender']=df_users['gender'].fillna('NaN')
#数值编码(连续型的编码规则)
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df_users['gender']=le.fit_transform(df_users['gender'])
sns.countplot(df_users['gender'])
#2.joinedAt列处理(日期数据,用年和月拼接就够了)
def getJoinedYearMonth(dateString):
try:
dttm=datetime.datetime.strptime(dateString,'%Y-%m-%dT%H:%M:%S.%fZ')
return int(''.join([str(dttm.year),str(dttm.month)]))
except:
return 0
df_users['joinedAt']=df_users['joinedAt'].map(getJoinedYearMonth)
#3.locale列处理(数值编码)
df_users['locale']=le.fit_transform(df_users['locale'])
#4.birthyear列处理:直接转换成数值
def getBirthYearInt(birthYear):
try:
return 0 if birthYear=='None' else int(birthYear)
except:
return 0
df_users['birthyear']=df_users['birthyear'].map(getBirthYearInt)
#5.timezone列处理
def getTimezoneInt(timezone):
try:
return int(timezone)
except:
return 0
df_users['timezone']=df_users['timezone'].map(getTimezoneInt)
#6.location列处理(填充nan并编码)
df_users['location']=df_users['location'].fillna('NaN')
df_users['location']=le.fit_transform(df_users['location'])
#处理完成
df_users.info()
##构建用户相似度矩阵
#导入所需数据(第一步生成的)
userIndex=_pickle.load(open(r'E:\python\all\third\PE_userIndex.pkl','rb'))
uniqueUserPairs=_pickle.load(open(r'E:\python\all\secend\PE_uniqueUserPairs.pkl','rb'))
nusers=len(userIndex.keys())
#构建用户特征矩阵
userMatrix=ss.dok_matrix((nusers,df_users.shape[1]-1))
colnames=df_users.columns
for j in range(len(df_users)):
if df_users['user_id'][j] in userIndex.keys():
i=userIndex[df_users['user_id'][j]]
for m in range(1,len(colnames)):
userMatrix[i,m-1]=df_users[colnames[m]][j]
userMatrix=normalize(userMatrix,norm='l1',axis=0,copy=False)
sio.mmwrite(r'E:\python\all\secend\US_userMatrix',userMatrix)
#计算用户相似度矩阵
userSimMatrix=ss.dok_matrix((nusers,nusers))
for i in range(0,nusers):
userSimMatrix[i,i]=1.0
for u1,u2 in uniqueUserPairs:
i=userIndex[u1]
j=userIndex[u2]
if (i,j) not in userSimMatrix:
sim=ssd.correlation
usim=sim(userMatrix.getrow(i).todense(),userMatrix.getrow(j).todense())
userSimMatrix[i,j]=usim
userSimMatrix[j,i]=usim
sio.mmwrite(r'E:\python\all\third\US_userSimMatrix',userSimMatrix)
三 用户社交关系挖掘
#导入user-friends表格
user_friends=pd.read_csv(r'E:\python\all\first\user_friends.csv')
#导入需要的数据
userIndex=_pickle.load(open(r'E:\python\all\third\PE_userIndex.pkl','rb'))
userEventScores=sio.mmread(r'E:\python\all\third\PE_userEventScores').todense()
#构建user-friends数量矩阵,user-friends针对event评分矩阵
nusers=len(userIndex.keys())
numFriends=np.zeros((nusers))
userFriends=ss.dok_matrix((nusers,nusers))
for j in range(len(user_friends)):
user=user_friends['user'][j]
if user in userIndex:
friends=user_friends['friends'][j].split(' ')
i=userIndex[user]
numFriends[i]=len(friends)
for friend in friends:
if friend in userIndex:
m=userIndex[friend]
eventsForUser=userEventScores.getrow(m).todense()
score=eventsForUser.sum()/np.shape(eventsForUser)[1]
userFriends[i,m]+=score
userFriends[m,i]+=score
#归一化数组(normalize_l1,规范后所有数据的绝对值的和为1)
sumNumFriends=numFriends.sum(axis=0)
numFriends=numFriends/sumNumFriends
sio.mmwrite(r'E:\python\all\third\UF_numFriends',np.matrix(numFriends))
userFriends=normalize(userFriends,norm='l1',axis=0,copy=False)
sio.mmwrite(r'E:\python\all\third\UF_userFriends',userFriends)
四 event相似度矩阵
#导入event数据
event=pd.read_csv(r'E:\python\all\first\events.csv')
#查看event描述性特征
event.columns[0:9]
event.iloc[:,0:9].isnull().sum()
event.iloc[:,0:9].info()
#event数据清洗
#1.日期列清洗
def getJoinedYearMonth(dateString):
try:
dttm=datetime.datetime.strptime(dateString,"%Y-%m-%dT%H:%M:%S.%fZ")
return int("".join([str(dttm.year),str(dttm.month)]))
except:
return 0
event['start_time']=event['start_time'].map(getJoinedYearMonth)
#2.特征列(地区),hashlib编码转换为数值
def getFeatureHash(value):
if str(value)=='nan':
return -1
else:
return int(hashlib.sha224(value.encode()).hexdigest()[0:4],16)
event['city']=event['city'].map(getFeatureHash)
event['state']=event['state'].map(getFeatureHash)
event['zip']=event['zip'].map(getFeatureHash)
event['country']=event['country'].map(getFeatureHash)
#3.数值列清洗
def getFloatValue(value):
if str(value)=='nan':
return 0.0
else:
return float(value)
event['lat']=event['lat'].map(getFloatValue)
event['lng']=event['lng'].map(getFloatValue)
#event相似度矩阵
#导入需要的数据
eventIndex=_pickle.load(open(r'E:\python\all\third\PE_eventIndex.pkl','rb'))
uniqueEventPairs=_pickle.load(open(r'E:\python\all\secend\PE_uniqueEventPairs.pkl','rb'))
#构建event特征矩阵
nevents=len(eventIndex.keys())
eventPropMatrix=ss.dok_matrix((nevents,7))
eventContMatrix=ss.dok_matrix((nevents,100))
cols=event.columns
for index in range(len(event)):
eventId=event['event_id'][index]
if eventId in eventIndex:
i=eventIndex[eventId]
for m in range(0,7):
eventPropMatrix[i,m]=event[cols[m+2]][index]
for j in range(9,109):
eventContMatrix[i,j-9]=event[cols[j]][index]
eventPropMatrix=normalize(eventPropMatrix,norm='l1',axis=0,copy=False)
sio.mmwrite(r'E:\python\all\secend\EV_eventPropMatrix',eventPropMatrix)
eventContMatrix=normalize(eventContMatrix,norm='l1',axis=0,copy=False)
sio.mmwrite(r'E:\python\all\secend\EV_eventContMatrix',eventContMatrix)
#构建event相似度矩阵
eventPropSim=ss.dok_matrix((nevents,nevents))
eventContSim=ss.dok_matrix((nevents,nevents))
psim=ssd.correlation
csim=ssd.cosine #文本数据余弦距离相似度
for e1,e2 in uniqueEventPairs:
i=eventIndex[e1]
j=eventIndex[e2]
if not ((i,j) in eventPropSim):
epsim=psim(eventPropMatrix.getrow(i).todense(),eventPropMatrix.getrow(j).todense())
eventPropSim[i,j]=epsim
eventPropSim[j,i]=epsim
#余弦相似度可能为空(零向量的距离为空)
if not ((i,j) in eventContSim):
ecsim=csim(eventContMatrix.getrow(i).todense(),eventContMatrix.getrow(j).todense())
if str(ecsim)=='nan':
ecsim=0
else:
pass
eventContSim[i,j]=ecsim
eventContSim[j,i]=ecsim
sio.mmwrite(r'E:\python\all\third\EV_eventPropSim',eventPropSim)
sio.mmwrite(r'E:\python\all\third\EV_eventContSim',eventContSim)
五 event热度数据
#导入相关数据
event_attendees=pd.read_csv(r'E:\python\all\first\event_attendees.csv')
eventIndex=_pickle.load(open(r'E:\python\all\third\PE_eventIndex.pkl','rb'))
#构建event热度矩阵
nevents=len(eventIndex)
eventPopularity=ss.dok_matrix((nevents,1))
for index in range(len(event_attendees)):
eventId=event_attendees['event'][index]
if eventId in eventIndex:
i=eventIndex[eventId]
if str(event_attendees['yes'][index])=='nan':
len_y=0
else:
len_y=len(event_attendees['yes'][index].split(' '))
if str(event_attendees['no'][index])=='nan':
len_n=0
else:
len_n=len(event_attendees['no'][index].split(' '))
eventPopularity[i,0]=len_y-len_n
eventPopularity=normalize(eventPopularity,norm='l1',axis=0,copy=False)
sio.mmwrite(r'E:\python\all\third\EA_eventPopularity',eventPopularity)
上述步骤构建了如下的数据集,简单介绍:
(后面提取特征就用这些数据集来构建)
userIndex #训练集测试集中的user
eventIndex #训练集测试集中的event
userEventScores #用户对event 的评分矩阵
userSimMatrix #用户(注册信息)相似度矩阵
eventPropSim #event相似度矩阵
eventContSim #event 关键词相似度矩阵
numFriends #用户朋友数量
userFriends #用户朋友对event热衷度
eventPopularity #活动热度
六 特征构建
#导入原始训练数据和测试数据
train=pd.read_csv(r'E:\python\all\first\train.csv')
test=pd.read_csv(r'E:\python\all\first\test.csv')
#提取训练数据及测试数据
y=train['interested']
train=train[['user','event','invited']]
test=test[['user','event','invited']]
#将训练数据和测试数据合并
data=pd.concat([train.assign(is_train=1),test.assign(is_train=0)],axis=0)
data=data.reset_index(drop=True)
#导入需要的数据
userIndex=_pickle.load(open(r'E:\python\all\third\PE_userIndex.pkl','rb'))
eventIndex=_pickle.load(open(r'E:\python\all\third\PE_eventIndex.pkl','rb'))
userEventScores=sio.mmread(r'E:\python\all\third\PE_userEventScores').todense()
userSimMatrix=sio.mmread(r'E:\python\all\third\US_userSimMatrix').todense()
eventPropSim=sio.mmread(r'E:\python\all\third\EV_eventPropSim').todense()
eventContSim=sio.mmread(r'E:\python\all\third\EV_eventContSim').todense()
numFriends=sio.mmread(r'E:\python\all\third\UF_numFriends')
userFriends=sio.mmread(r'E:\python\all\third\UF_userFriends').todense()
eventPopularity=sio.mmread(r'E:\python\all\third\EA_eventPopularity').todense()
#特征构建
#基于用户的协同过滤-UserCF协同过滤,得到event的推荐度
def userReco(userId,eventId):
i=userIndex[userId]
j=eventIndex[eventId]
vs=userEventScores[:,j]
sims=userSimMatrix[i,:]
prod=sims*vs
try:
return prod[0,0]-userEventScores[i,j]
except IndexError:
return 0
list_ur=[]
for index in range(len(data)):
list_ur.append(userReco(data['user'][index],data['event'][index]))
data['user_reco']=pd.Series(list_ur)
#基于event的协同过滤-itemCF,得到Event的推荐度
def eventReco(userId,eventId):
i=userIndex[userId]
j=eventIndex[eventId]
js=userEventScores[i,:]
psim=eventPropSim[:,j]
csim=eventContSim[:,j]
pprod=js*psim
cprod=js*csim
pscore=0
cscore=0
try:
pscore=pprod[0,0]-userEventScores[i,j]
except IndexError:
pass
try:
cscore=cprod[0,0]-userEventScores[i,j]
except IndexError:
pass
return pscore,cscore
list_ep=[]
list_ec=[]
for index in range(len(data)):
pscore,cscore=eventReco(data['user'][index],data['event'][index])
list_ep.append(pscore)
list_ec.append(cscore)
data['evt_p_reco']=pd.Series(list_ep)
data['evt_c_reco']=pd.Series(list_ec)
#基于用户的朋友个数来推断用户的社交程度
#主要的考量是如果用户的朋友非常多,可能会更倾向于参加各种社交活动
def userPop(userId):
if userId in userIndex:
i=userIndex[userId]
try:
return numFriends[0,i]
except IndexError:
return 0
else:
return 0
list_up=[]
for index in range(len(data)):
list_up.append(userPop(data['user'][index]))
data['user_pop']=pd.Series(list_ep)
#朋友对用户的影响:
#主要考量用户的所有朋友中,有多少是非常喜欢参加各种社交活动的
#用户的朋友圈如果都是积极参加各种event,可能会对当前用户有一定的影响
def friendInfluence(userId):
nusers=np.shape(userFriends)[1]
i=userIndex[userId]
return (userFriends[i,:].sum(axis=0)/nusers)[0,0]
list_ff=[]
for index in range(len(data)):
list_ff.append(friendInfluence(data['user'][index]))
data['frnd_infl']=pd.Series(list_ff)
#活动本身的热度,主要通过参与的参数来界定
def eventPop(eventId):
i=eventIndex[eventId]
return eventPopularity[i,0]
list_epp=[]
for index in range(len(data)):
list_epp.append(eventPop(data['event'][index]))
data['evt_pop']=pd.Series(list_epp)
#构造训练数据及测试数据集
ocolnames=['user','event','invited','user_reco','evt_p_reco','evt_c_reco','user_pop','frnd_infl','evt_pop']
train=data[data['is_train']==1][ocolnames]
test=data[data['is_train']==0][ocolnames]
df_train=pd.concat((train,y),axis=1)
test=test.reset_index(drop=True)
df_train=df_train.to_csv(r'E:\python\all\third\df_train.csv',index=False)
test=test.to_csv(r'E:\python\all\third\df_test.csv',index=False)
七 模型构建
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#导入数据集
train=pd.read_csv(r'E:\python\all\third\df_train.csv')
test=pd.read_csv(r'E:\python\all\third\df_test.csv')
#构建模型数据集
ocolnames=['invited','user_reco','evt_p_reco','evt_c_reco','user_pop','frnd_infl','evt_pop']
X=train[ocolnames]
y=train['interested']
'''
#归一化处理:有些模型需要。本文其实构建特征时已经有处理(此处可以不做)
from sklearn.preprocessing import StandardScaler
x_ss=StandardScaler().fit_transform(X)
data=pd.DataFrame(x_ss)
data.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99])
'''
'''
#PCA构建降维可视化看下情况(非必须)
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
x_dr=pca.fit_transform(X)
plt.figure()
plt.scatter(x_dr[y==0,0],x_dr[y==0,1],c='red',label='no')
plt.scatter(x_dr[y==1,0],x_dr[y==1,0],c='black',label='yes')
plt.legend()
plt.title('PCA of Data')
plt.show()
'''
#分割训练数据
from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest=train_test_split(X,y,test_size=0.3,random_state=0)
#1.逻辑回归
from sklearn.linear_model import LogisticRegression as LR
from sklearn.metrics import accuracy_score
#利用正则化l1和l2构建不同的模型
lrl1=LR(penalty='l1',solver='liblinear',C=0.5,max_iter=1000)
lrl2=LR(penalty='l2',solver='liblinear',C=0.5,max_iter=1000)
lrl1=lrl1.fit(Xtrain,ytrain)
lrl1.coef_
lrl2=lrl2.fit(Xtrain,ytrain)
lrl2.coef_
lrl1.score(Xtest,ytest)
lrl2.score(Xtest,ytest)
#逻辑回归模型表现很不好,当然也可以再引入多项式看下效果,自己可以下去尝试下
#2.支持向量机
from sklearn.svm import SVC
clf=SVC(kernel='rbf',gamma='auto')
clf.fit(Xtrain,ytrain)
clf.score(Xtest,ytest)
clf.score(Xtrain,ytrain)
#也可以用k则交叉验证看下模型评分
from sklearn.model_selection import cross_val_score
clf=SVC(kernel='rbf',gamma='auto')
score_train=cross_val_score(clf,X,y,cv=10,scoring='accuracy')
score_train.mean()
#画样本数据集的学习曲线(可以直观的看到模型是否过拟合/欠拟合)
from sklearn.model_selection import learning_curve
def plot_learning_curve(clf,title,X,y,cv=10,train_sizes=np.linspace(.1,1.0,5)):
plt.figure()
plt.title(title)
plt.xlabel('Training examples')
plt.ylabel('Score')
train_sizes,train_scores,test_scores=learning_curve(clf,X,y,cv=cv,train_sizes=train_sizes)
train_scores_mean=np.mean(train_scores,axis=1)
train_scores_std=np.std(train_scores,axis=1)
test_scores_mean=np.mean(test_scores,axis=1)
test_scores_std=np.std(test_scores,axis=1)
plt.grid()
plt.fill_between(train_sizes,train_scores_mean-train_scores_std,
train_scores_mean+train_scores_std,alpha=0.1,color='g')
plt.fill_between(train_sizes,test_scores_mean-test_scores_std,
test_scores_mean+test_scores_std,alpha=0.1,color='r')
plt.plot(train_sizes,train_scores_mean,'o-',color='g',label='training score')
plt.plot(train_sizes,test_scores_mean,'o-',color='r',label='testing score')
plt.legend(loc='best')
return plt
#svc模型的学习曲线
s=plot_learning_curve(SVC(kernel='rbf',gamma='auto'),'SVM',X,y)
#svm调参
#1.学习曲线调参,调gamma
score=[]
gamma_range=np.logspace(-10,1,50)
for i in gamma_range:
clf=SVC(kernel='rbf',gamma=i,cache_size=5000).fit(Xtrain,ytrain)
score.append(clf.score(Xtest,ytest))
print(max(score),gamma_range[score.index(max(score))])
plt.plot(gamma_range,score)
plt.show()
#缩小范围继续调参
score=[]
gamma_range=np.linspace(0.5,1,30)
for i in gamma_range:
clf=SVC(kernel='rbf',gamma=i,cache_size=5000).fit(Xtrain,ytrain)
score.append(clf.score(Xtest,ytest))
print(max(score),gamma_range[score.index(max(score))])
plt.plot(gamma_range,score)
plt.show()
#调参损失系数C
score=[]
C_range=np.linspace(0.01,30,50)
for i in C_range:
clf=SVC(kernel='rbf',C=i,gamma=0.9310344827586207).fit(Xtrain,ytrain)
score.append(clf.score(Xtest,ytest))
print(max(score),C_range[score.index(max(score))])
plt.plot(C_range,score)
plt.show()
#缩小范围
score=[]
C_range=np.linspace(5,8,20)
for i in C_range:
clf=SVC(kernel='rbf',C=i,gamma=0.9310344827586207).fit(Xtrain,ytrain)
score.append(clf.score(Xtest,ytest))
print(max(score),C_range[score.index(max(score))])
plt.plot(C_range,score)
plt.show()
#gamma及C最佳参数:
s=plot_learning_curve(SVC(kernel='rbf',C=6.894736842105263,gamma=0.9310344827586207),'SVM',X,y)
#SVC表现还不错,尝试其它模型
#3.随机森林模型
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
r=plot_learning_curve(RandomForestClassifier(n_estimators=100),'RFC',X,y)
#学习曲线调参n_estimators
score=[]
paras=[50,80,100,150,200,300,500]
for i in paras:
clf=RandomForestClassifier(n_estimators=i).fit(Xtrain,ytrain)
score.append(clf.score(Xtest,ytest))
print(max(score),paras[score.index(max(score))])
plt.plot(paras,score)
plt.show()
#调参max_depth
score=[]
paras=[15,18,20,25,28]
for i in paras:
clf=RandomForestClassifier(n_estimators=300,max_depth=i).fit(Xtrain,ytrain)
score.append(clf.score(Xtest,ytest))
print(max(score),paras[score.index(max(score))])
plt.plot(paras,score)
plt.show()
#调参max_features
score=[]
paras=[1,2,3,4,5,6,7]
for i in paras:
clf=RandomForestClassifier(n_estimators=300,max_depth=18,max_features=i).fit(Xtrain,ytrain)
score.append(clf.score(Xtest,ytest))
print(max(score),paras[score.index(max(score))])
plt.plot(paras,score)
plt.show()
#最佳参数
r=plot_learning_curve(RandomForestClassifier(n_estimators=300,max_depth=18,max_features=4),'RFC',X,y)
#可以用网格搜索调参max_samples_split和min_samples_leaf,但调参后模型表现降低了,就用默认值没有再细调,或许还能提升。但默认值的表现已经很不错了
from sklearn.model_selection import GridSearchCV
rfc=RandomForestClassifier(n_estimators=300,max_depth=18,max_features=4)
para_grid={'min_samples_split':[2,5,10],'min_samples_leaf':[1,5,10]}
gs=GridSearchCV(rfc,param_grid=para_grid,cv=3,scoring='accuracy')
gs.fit(X,y)
gs.best_estimator_ #返回网格搜索的最佳模型及各参数取值
gs.best_score_ #返回网格搜索中最优参数的评分值
gs_best=gs.best_estimator_ #网格搜索构建的最优模型
#4.Adaboost模型
a=plot_learning_curve(AdaBoostClassifier(n_estimators=100),'ADB',X,y)
adb=AdaBoostClassifier()
#简单调了两个参数,最优为:
a=plot_learning_curve(AdaBoostClassifier(n_estimators=600,learning_rate=0.8),'ADB',X,y)
#调参n_estimators:这里应该和learning_rate一起调的
#但是由于网格搜索太慢,也可以先锁定learning_rate,不断用学习曲线尝试
#理论上讲,learning_rate降低后,基模型数量要大幅增加
score=[]
paras=[700,800,900,1000,1200,1500]
for i in paras:
clf=AdaBoostClassifier(n_estimators=i,learning_rate=0.6).fit(Xtrain,ytrain)
score.append(clf.score(Xtest,ytest))
print(max(score),paras[score.index(max(score))])
plt.plot(paras,score)
plt.show()
#Adaboost还有一些基分类器的参数可以调整,这里就不再做尝试。总体表现比上面的随机森林还是差些
#5.xgboost模型
#可能由于数据集不算大,xgboost并没有想象中表现那么好。当然是相对上面构建的随机森林模型来说的。
#其实也算不错了,但是第一步调参n_estimators时,基分类器数目就很少。而且表现并没有随机森林那么突出
#这里就不再继续调参了,随机森林的训练得分0.927489,也算满意的结果。
import xgboost as xgb
cv_params = {'n_estimators': [20,50,80,100,150]}
other_params = {'learning_rate': 0.1, 'n_estimators': 500, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}
model=xgb.XGBClassifier(**other_params)
xg=GridSearchCV(estimator=model,param_grid=cv_params,cv=5,scoring='accuracy')
xg.fit(X,y)
xg.best_estimator_
xg.best_score_
结果提交
用上面构建的相关模型,对测试集进行预测,并整理成提交的格式。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#导入数据集
train=pd.read_csv(r'E:\python\all\third\df_train.csv')
test=pd.read_csv(r'E:\python\all\third\df_test.csv')
#构建模型数据集
ocolnames=['invited','user_reco','evt_p_reco','evt_c_reco','user_pop','frnd_infl','evt_pop']
X=train[ocolnames]
y=train['interested']
#导入交叉验证包
from sklearn.model_selection import cross_val_score
#导入SGDC分类器
from sklearn.linear_model import SGDClassifier
clf=SGDClassifier(loss='log',penalty='l2')
score=np.mean(cross_val_score(clf,X,y,cv=10,scoring='accuracy'))
#导入SVC分类器
from sklearn.svm import SVC
svc=SVC(kernel='rbf',C=6.894736842105263,gamma=0.9310344827586207)
score_s=np.mean(cross_val_score(svc,X,y,cv=10,scoring='accuracy'))
#导入随机森林分类器
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier(n_estimators=300,max_depth=18,max_features=4)
score_r=np.mean(cross_val_score(rfc,X,y,cv=10,scoring='accuracy'))
#导入需要的结果列及测试集构建
df_test=pd.read_csv(r'E:\python\all\third\df_test.csv')
result=df_test[['user','event']]
test=df_test[ocolnames]
#分别构造上面三个分类器的结果
result_c=result.copy()
result_s=result.copy()
result_r=result.copy()
#SGDC分类器(由于需要对每个event排序,需要构建评分列)
clf=clf.fit(X,y)
result_c['outcome']=clf.predict(test)
result_c['dist']=clf.decision_function(test)
#SVC分类器
svc=svc.fit(X,y)
result_s['outcome']=svc.predict(test)
result_s['dist']=svc.decision_function(test)
#随机森林分类器
rfc=rfc.fit(X,y)
result_r['outcome']=rfc.predict(test)
result_r['dist']=rfc.predict_proba(test)[:,1]
#下面用SVC的结果导出最终需要的提交格式(效果比较好点)
result_user=[]
for user in result_s['user']:
if user not in result_user:
result_user.append(user)
else:
pass
#最终提交格式给uesr,events两列即可.events中,按推荐度排序,并用空格分隔
list_user=[]
list_events=[]
for user in result_user:
list_user.append(user)
data=result_s[result_s['user']==user].copy()
data=data.sort_values(by='dist',ascending=False)
list_e=[]
for event in data['event']:
list_e.append(str(event))
list_events.append(' '.join(list_e))
final_result=pd.DataFrame({'user':list_user,'events':list_events})
final_result=final_result[['user','events']]
final_result.to_csv(r'E:\python\all\third\result')