kkbox-music-recommendation-challenge学习整理
此方案为7月份通过Kaggle竞赛宝典公众号分享的kaggle比赛的Top方案,本来想直接进行复现后再次学习的,不过由于各种事情没有马上弄完,现在刚刚学习完毕,来进行整理和总结。
1.数据介绍
该方案是基于LightGBM的框架,特征工程相对较多,我也是为了强化自己的特征工程学习进行了复现和分析。好了,废话少说,下面开始介绍数据:
本方案是一个音乐推荐的方案,主要文件如下所示:
train.csv
列名 | 解释 |
---|---|
msno | user id |
song_id | song id |
source_system_tab | the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search. |
source_screen_name | name of the layout a user sees. |
source_type | an entry point a user first plays music on mobile apps. An entry point could be album , online-playlist , song … etc. |
targe | this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise . |
test.csv
列名 | 解释 |
---|---|
id | row id (will be used for submission) |
msno | user id |
song_id | song id |
source_system_tab | the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search. |
source_screen_name | name of the layout a user sees. |
source_type | an entry point a user first plays music on mobile apps. An entry point could be album , online-playlist , song … etc. |
songs.csv
列名 | 解释 |
---|---|
song_id | song_id |
song_length | song_length |
genre_ids | genre category. Some songs have multiple genres and they are separated by | |
artist_name | artist_name |
composer | composer |
lyricist | lyricist |
language | language |
members.csv
列名 | 解释 |
---|---|
msno | msno |
city | city |
bd | age. Note: this column has outlier values, please use your judgement. |
gender | gender |
registered_via | registration method |
registration_init_time | format %Y%m%d |
expiration_date | format %Y%m%d |
song_extra_info.csv
列名 | 解释 |
---|---|
song_id | song_id |
song name | the name of the song. |
isrc | International Standard Recording Code, theoretically can be used as an identity of a song. However, what worth to note is, ISRCs generated from providers have not been officially verified; therefore the information in ISRC, such as country code and reference year, can be misleading/incorrect. Multiple songs could share one ISRC since a single recording could be re-published several times. |
本次的特征工程个人理解主要是针对uid和song_id进行的各种groupby下的数值型特征,其中包括各个成分的主成分分析特征、时间滑窗特征、时间累计数量特征等。下面进行特征工程介绍部分:
2.基础特征
基础特征部分,按个人理解首先是进行了各个特征的规范化处理,包括各列id特征的规范化,已经多个歌曲种类的拆分,数据的合并等。
废话少说,先上代码:
#导包
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
import lightgbm as lgb
from itertools import combinations
import scipy.sparse as sp
from scipy.sparse import coo_matrix
from lightfm import LightFM
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
#读取数据
date_columnsdate = ['expiration_date', 'registration_init_time']
path = './data/'
train_data = pd.read_csv(path + 'train.csv')
test_data = pd.read_csv(path + 'test.csv')
songs_data = pd.read_csv(path + 'songs.csv')
song_extra_info_data = pd.read_csv(path + 'song_extra_info.csv')
members_data = pd.read_csv(path + 'members.csv',parse_dates=date_columnsdate)
统计歌曲是否出现过,是则标记1,筛选出出现过的歌曲数据
songs_appear = set(train_data['song_id'].append(test_data['song_id'])) #出现过的歌曲
#取出出现歌曲
songs_data['appeared'] = songs_data['song_id'].apply(lambda x: 1 if x in songs_appear else 0)
songs_data = songs_data[songs_data.appeared == 1]
songs_data.drop('appeared',axis=1,inplace=True)
#取出出现歌曲
song_extra_info_data['appeared'] = song_extra_info_data['song_id'].apply(lambda x: 1 if x in songs_appear else 0)
song_extra_info_data = song_extra_info_data[song_extra_info_data.appeared == 1]
song_extra_info_data.drop('appeared',axis=1,inplace=True)
# 2.1.2 Msno相关的数据集中如有未出现在train以及test中的msno,则删除
msno_appear = set(train_data['msno'].append(test_data['msno']))
#取出出现uid
members_data['appeared'] = members_data['msno'].apply(lambda x: 1 if x in msno_appear else 0)
members_data = members_data[members_data.appeared == 1]
members_data.drop('appeared',axis=1,inplace=True)
对uid及song_id进行label编码,转换为自然数
#msno编码
msno_encoder = LabelEncoder()
msno_encoder.fit(members_data['msno'].values)
members_data['msno'] = msno_encoder.transform(members_data['msno'])
train_data['msno'] = msno_encoder.transform(train_data['msno'])
test_data['msno'] = msno_encoder.transform(test_data['msno'])
#song_id编码
songid_encoder = LabelEncoder()
songid_encoder.fit(train_data['song_id'].append(test_data['song_id']))
train_data['song_id'] = songid_encoder.transform(train_data['song_id'])
songs_data['song_id'] = songid_encoder.transform(songs_data['song_id'])
test_data['song_id'] = songid_encoder.transform(test_data['song_id'])
song_extra_info_data['song_id'] = songid_encoder.transform(song_extra_info_data['song_id']
对其他出现过的类别特征进行编码,其字段为来源的设备类型等。
# source_system_tab,source_screen_name,source_type编码
columns = ['source_system_tab','source_screen_name','source_type']
for col in columns:
print(col)
column_encoder = LabelEncoder()
if train_data[col].dtypes == 'O':
column_encoder.fit(train_data[col].fillna('nan').append(test_data[col].fillna('nan')))
train_data[col] = column_encoder.transform(train_data[col].fillna('nan'))
test_data[col] = column_encoder.transform(test_data[col].fillna('nan'))
else:
column_encoder.fit(train_data[col].fillna(-1).append(test_data[col].fillna(-1)))
train_data[col] = column_encoder.transform(train_data[col].fillna(-1))
test_data[col] = column_encoder.transform(test_data[col].fillna(-1))
# city, gender, registered_via编码
columns = ['city','gender','registered_via']
for col in columns:
print(col)
column_encoder = LabelEncoder()
if members_data[col].dtypes == 'O':
column_encoder.fit(members_data[col].fillna('nan'))
members_data[col] = column_encoder.transform(members_data[col].fillna('nan'))
else:
column_encoder.fit(members_data[col].fillna(-1))
members_data[col] = column_encoder.transform(members_data[col].fillna(-1))
对以’|'间隔的歌曲类型特征将其拆分为4列
# genre_ids特征
def get_genreids_split(df):
genreids_split = np.zeros((len(df),4))
for i in range(len(df)):
if df[i] == 'nan':
continue
else:
num_genre = str(df[i]).count('|')
splits = str(df[i]).split('|')
if num_genre+1 > 2:
genreids_split[i,0] = int(splits[0])
genreids_split[i,1] = int(splits[1])
genreids_split[i,2] = int(splits[2])
elif num_genre+1 > 1:
genreids_split[i,0] = int(splits[0])
genreids_split[i,1] = int(splits[1])
elif num_genre+1 == 1:
genreids_split[i,0] = int(splits[0])
genreids_split[i,3] = num_genre + 1
return genreids_split
genreids_split = get_genreids_split(songs_data['genre_ids'].fillna('nan').values)
songs_data['first_genre_id'] = genreids_split[:,0]
songs_data['second_genre_id'] = genreids_split[:,1]
songs_data['third_genre_id'] = genreids_split[:,2]
songs_data['fourth_genre_id'] = genreids_split[:,3]
genre_encoder = LabelEncoder()
genre_encoder.fit(songs_data['first_genre_id'].append(songs_data['second_genre_id']).append(songs_data['third_genre_id']))
songs_data['first_genre_id'] = genre_encoder.transform(songs_data['first_genre_id'])
songs_data['second_genre_id'] = genre_encoder.transform(songs_data['second_genre_id'])
songs_data['third_genre_id'] = genre_encoder.transform(songs_data['third_genre_id'])
对于艺术家这列的特征取第一个
#artist_name特征
def artist_count(x):
return x.count('and') + x.count(',') + x.count(' feat') + x.count('&') + 1
def get_first_artist(x):
if x.count('and') > 0:
x = x.split('and')[0]
if x.count(',') > 0:
x = x.split(',')[0]
if x.count(' feat') > 0:
x = x.split(' feat')[0]
if x.count('&') > 0:
x = x.split('&')[0]
return x.strip()
songs_data['artist_cnt'] = songs_data['artist_name'].apply(artist_count).astype(np.int8)
songs_data['first_artist_name'] = songs_data['artist_name'].apply(get_first_artist)
lyricist & composer的数量特征和返回第一个字段
#lyricist & composer特征
def lyricist_or_composer_count(x):
try:
return x.count('and') + x.count('/') + x.count('|') + x.count('\\') + x.count(';') + x.count('&') + 1
except:
return 0
def get_first_lyricist_or_composer(x):
try:
if x.count('and') > 0:
x = x.split('and')[0]
if x.count(',') > 0:
x = x.split(',')[0]
if x.count(' feat') > 0:
x = x.split(' feat')[0]
if x.count('&') > 0:
x = x.split('&')[0]
if x.count('|') > 0:
x = x.split('|')[0]
if x.count('/') > 0:
x = x.split('/')[0]
if x.count('\\') > 0:
x = x.split('\\')[0]
if x.count(';') > 0:
x = x.split(';')[0]
return x.strip()
except:
return x
songs_data['lyricist_cnt'] = songs_data['lyricist'].apply(lyricist_or_composer_count).astype(np.int8)
songs_data['composer_cnt'] = songs_data['composer'].apply(lyricist_or_composer_count).astype(np.int8)
songs_data['first_lyricist_name'] = songs_data['lyricist'].apply(get_first_lyricist_or_composer)
songs_data['first_composer_name'] = songs_data['composer'].apply(get_first_lyricist_or_composer)
对返回取第一个元素的特征进行编码
columns = ['first_artist_name','first_lyricist_name','first_composer_name']
for col in columns:
print(col)
encoder = LabelEncoder()
encoder.fit(songs_data[col].fillna('nan'))
songs_data[col] = encoder.transform(songs_data[col].fillna('nan'))
处理特殊值
#is_featured
songs_data['is_featured'] = songs_data['artist_name'].apply(lambda x: 1 if ' feat' in str(x) else 0).astype(np.int8)
#language
songs_data['language'] = songs_data['language'].fillna(-1)
songs_data.drop(['genre_ids','artist_name','lyricist','composer'],axis=1,inplace = True)
# 奇异值处理
members_data['bd'] = members_data['bd'].apply(lambda x:np.nan if x<0 or x>=80 else x)
song_id为父类的groupby特征
#song_id的父类0阶特征(nunique)
song = pd.DataFrame({'song_id': range(max(train_data.song_id.max(), test_data.song_id.max())+1)})
song = song.merge(songs_data, on='song_id', how='left')
song = song.merge(song_extra_info_data, on='song_id', how='left')
#构筑'song_id'下级数量特征
song_columns = ['language','first_genre_id','second_genre_id','third_genre_id','first_artist_name','first_lyricist_name','first_composer_name']
for col in song_columns:
print(col)
col_song_cnt = song.groupby(by=col)['song_id'].count().to_dict()
song[col+'_song_cnt'] = song[col].apply(lambda x: col_song_cnt[x] if not np.isnan(x) else np.nan)
#song_id的父类0阶特征(count)
data = train_data[['msno', 'song_id']].append(test_data[['msno', 'song_id']])
msno_rec_cnt = data.groupby(by='msno')['song_id'].count().to_dict()
members_data['msno_rec_cnt'] = members_data['msno'].apply(lambda x: msno_rec_cnt[x] )
data = data.merge(song, on='song_id', how='left')
song_columns = ['song_id','language','first_genre_id','second_genre_id','third_genre_id','first_artist_name','first_lyricist_name','first_composer_name']
#各col下用户数量特征
for col in song_columns:
print(col)
col_rec_cnt = data.groupby(by=col)['msno'].count().to_dict()
song[col+'_rec_cnt'] = song[col].apply(lambda x: col_rec_cnt[x] if not np.isnan(x) else np.nan)
#source_system_tab, source_system_screen_name, source_type的(count)&比例特征(prob)
cols = ['source_system_tab','source_screen_name','source_type']
concat = train_data.drop('target', axis=1).append(test_data.drop('id', axis=1))
msno_rec_cnt = data.groupby(by='msno')['song_id'].count().to_dict()
train_data['msno_rec_cnt'] = train_data['msno'].apply(lambda x: msno_rec_cnt[x])
test_data['msno_rec_cnt'] = test_data['msno'].apply(lambda x: msno_rec_cnt[x])
for col in cols:
print(col)
tmp = concat.groupby(['msno',col])['song_id'].agg([('msno_'+col+'_cnt','count')]).reset_index() # 出现次数 & 出现次数占比
train_data = train_data.merge(tmp, on=['msno',col],how='left')
train_data['msno_'+col+'_prob'] = train_data['msno_'+col+'_cnt'] * 1.0 / train_data['msno_rec_cnt']
test_data = test_data.merge(tmp, on=['msno',col],how='left')
test_data['msno_'+col+'_prob'] = test_data['msno_'+col+'_cnt'] * 1.0 / test_data['msno_rec_cnt']
train_data.drop('msno_rec_cnt',axis=1,inplace = True)
test_data.drop('msno_rec_cnt',axis=1,inplace = True)
对irsc特征进行转换
#按照wiki的方法将irsc数据进行转换 & 编码
isrc = song['isrc']
song['cc'] = isrc.str.slice(0,2)
song['xxx'] = isrc.str.slice(2,5)
song['yy'] = isrc.str.slice(5,7).astype(float)
song['yy'] = song['yy'].apply(lambda x: 2000+x if x < 18 else 1900 +x)
song['cc'] = LabelEncoder().fit_transform(song['cc'].fillna('nan'))
song['xxx'] = LabelEncoder().fit_transform(song['xxx'].fillna('nan'))
song['irsc_missing'] = (song['cc'] == 0) * 1.0
# irsc相关的count特征
columns = ['cc','xxx','yy']
for col in columns:
print(col)
song_ccxxxyy_cnt = song.groupby(by = col)['song_id'].count().to_dict()
song_ccxxxyy_cnt[0] = None
song[col+'_song_cnt'] = song[col].apply(lambda x:song_ccxxxyy_cnt[x] if not np.isnan(x) else None)
data = train_data[['msno','song_id']].append(test_data[['msno','song_id']])
data = data.merge(song, on ='song_id', how = 'left')
columns = ['cc','xxx','yy']
for col in columns:
print(col)
song_ccxxxyy_cnt = data.groupby(by = col)['song_id'].count().to_dict()
song_ccxxxyy_cnt[0] = None
song[col+'_rec_cnt'] = song[col].apply(lambda x:song_ccxxxyy_cnt[x] if not np.isnan(x) else None)
song.drop(['name','isrc'],axis=1,inplace=True)
uid与song_id的svd分解特征
#user-song
n_components = 30
msno = concat['msno'].values
song_id = concat['song_id'].values
rating = sparse.coo_matrix((np.ones(len(concat)), (msno, song_id)))
rating = (rating > 0) * 1.
[u,s,vt] = svds(rating, k = n_components)
s_song = np.diag(s[::-1])
members_topics = pd.DataFrame(u[:, ::-1])
members_topics.columns = ['member_component_%d'%i for i in range(n_components)]
members_topics['msno'] = range(member_cnt)
members_data = members_data.merge(members_topics, on='msno', how='left')
song_topics = pd.DataFrame(vt.transpose()[:, ::-1])
song_topics.columns = ['song_component_%d'%i for i in range(n_components)]
song_topics['song_id'] = range(song_cnt)
song = song.merge(song_topics, on='song_id', how='left')
uid与artist的svd分解特征
#user-artist
n_components = 20
concat = concat.merge(song[['song_id','first_artist_name']])
concat = concat[concat['first_artist_name'] >=0]
msno = concat['msno'].values
artist = concat['first_artist_name'].values
rating_tmp = sparse.coo_matrix((np.ones(len(concat)), (msno, artist.astype('int'))))
rating_2 = np.log1p(rating_tmp) * 0.3 + (rating_tmp > 0) * 1.0
[u, s, vt] = svds(rating_2, k=n_components)
print(s[::-1])
s_artist = np.diag(s[::-1])
members_topics = pd.DataFrame(u[:, ::-1])
members_topics.columns = ['member_artist_component_%d'%i for i in range(n_components)]
members_topics['msno'] = range(member_cnt)
members_data = members_data.merge(members_topics, on='msno', how='left')
artist_topics = pd.DataFrame(vt.transpose()[:, ::-1])
artist_topics.columns = ['artist_component_%d'%i for i in range(n_components)]
artist_topics['first_artist_name'] = range(artist_cnt)
song = song.merge(artist_topics, on='first_artist_name', how='left')
#加入点积
members_data = members_data.sort_values('msno')
song = song.sort_values('song_id')
mem_cols = ['member_component_%d'%i for i in range(30)]
song_cols = ['song_component_%d'%i for i in range(30)]
member_embeddings = members_data[mem_cols].values
song_embeddings = song[song_cols].values
member_cols = ['member_artist_component_%d'%i for i in range(20)]
song_cols = ['artist_component_%d'%i for i in range(20)]
member_artist_embeddings = members_data[member_cols].values
song_artist_embeddings = song[song_cols].values
train_dot = np.zeros((len(train_data), 2))
test_dot = np.zeros((len(test_data), 2))
for i in range(len(train_data)):
if i % 10000 == 0:
print(i / train_data.shape[0])
msno_idx = train_data['msno'].values[i]
song_idx = train_data['song_id'].values[i]
train_dot[i, 0] = np.dot(member_embeddings[msno_idx], np.dot(s_song, song_embeddings[song_idx]))
train_dot[i, 1] = np.dot(member_artist_embeddings[msno_idx], np.dot(s_artist, song_artist_embeddings[song_idx]))
for i in range(len(test_data)):
if i % 10000 == 0:
print(i / test_data.shape[0])
msno_idx = test_data['msno'].values[i]
song_idx = test_data['song_id'].values[i]
test_dot[i, 0] = np.dot(member_embeddings[msno_idx], np.dot(s_song, song_embeddings[song_idx]))
test_dot[i, 1] = np.dot(member_artist_embeddings[msno_idx], np.dot(s_artist, song_artist_embeddings[song_idx]))
train_data['song_embeddings_dot'] = train_dot[:, 0]
train_data['artist_embeddings_dot'] = train_dot[:, 1]
test_data['song_embeddings_dot'] = test_dot[:, 0]
test_data['artist_embeddings_dot'] = test_dot[:, 1]
3.高级特征
取不同步数的滑窗特征,分别统计各列在前后滑窗中的重复次数。这部分运行时间较长。
from collections import defaultdict
## continous index
concat = train_data[['msno', 'song_id']].append(test_data[['msno', 'song_id']])
concat['timestamp'] = range(len(concat))
## windows_based count
window_sizes = [10, 50, 100, 500, 5000, 10000]
msno_list = concat['msno'].values
song_list = concat['song_id'].values
def get_window_cnt(values, idx, window_size):
lower = max(0, idx - window_size)
upper = min(len(values), idx+window_size)
return (values[lower:idx] == values[idx]).sum(), (values[idx:upper] == values[idx]).sum()
for window_size in window_sizes:
msno_before_cnt = np.zeros(len(concat))
song_before_cnt = np.zeros(len(concat))
msno_after_cnt = np.zeros(len(concat))
song_after_cnt = np.zeros(len(concat))
for i in range(len(concat)):
msno_before_cnt[i], msno_after_cnt[i] = get_window_cnt(msno_list, i, window_size)
song_before_cnt[i], song_after_cnt[i] = get_window_cnt(song_list, i, window_size)
concat['msno_%d_before_cnt'%window_size] = msno_before_cnt
concat['song_%d_before_cnt'%window_size] = song_before_cnt
concat['msno_%d_after_cnt'%window_size] = msno_after_cnt
concat['song_%d_after_cnt'%window_size] = song_after_cnt
print('Window size for %d done.'%window_size)
按时间次序的的计数特征
## till_now count
msno_dict = defaultdict(lambda: 0)
song_dict = defaultdict(lambda: 0)
msno_till_now_cnt = np.zeros(len(concat))
song_till_now_cnt = np.zeros(len(concat))
for i in range(len(concat)):
msno_till_now_cnt[i] = msno_dict[msno_list[i]]
msno_dict[msno_list[i]] += 1
song_till_now_cnt[i] = song_dict[song_list[i]]
song_dict[song_list[i]] += 1
concat['msno_till_now_cnt'] = msno_till_now_cnt
concat['song_till_now_cnt'] = song_till_now_cnt
print('Till-now count done.')
## varience
def timestamp_map(x):
if x < 7377418:
x = (x - 0.0) / (7377417.0 - 0.0) * (1484236800.0 - 1471190400.0) + 1471190400.0
else:
x = (x - 7377417.0) / (9934207.0 - 7377417.0) * (1488211200.0 - 1484236800.0) + 1484236800.0
return x
concat['timestamp'] = concat['timestamp'].apply(timestamp_map)
msno_mean = concat.groupby(by='msno').mean()['timestamp'].to_dict()
members_data['msno_timestamp_mean'] = members_data['msno'].apply(lambda x: msno_mean[x])
msno_std = concat.groupby(by='msno').std()['timestamp'].to_dict()
members_data['msno_timestamp_std'] = members_data['msno'].apply(lambda x: msno_std[x])
song_mean = concat.groupby(by='song_id').mean()['timestamp'].to_dict()
song['song_timestamp_mean'] = song['song_id'].apply(lambda x: song_mean[x])
song_std = concat.groupby(by='song_id').std()['timestamp'].to_dict()
song['song_timestamp_std'] = song['song_id'].apply(lambda x: song_std[x])
print('Varience done.')
保存滑窗特征
## save to files
features = ['msno_till_now_cnt', 'song_till_now_cnt']
for window_size in window_sizes:
features += ['msno_%d_before_cnt'%window_size, 'song_%d_before_cnt'%window_size, \
'msno_%d_after_cnt'%window_size, 'song_%d_after_cnt'%window_size]
features += ['timestamp']
data = concat[features].values
for i in range(len(features)):
train_data[features[i]] = data[:len(train_data), i]
test_data[features[i]] = data[len(train_data):, i]
train_data['timestamp'] = concat.iloc[:len(train_data)]['timestamp'].values
test_data['timestamp'] = concat.iloc[len(train_data):]['timestamp'].values
## continous index
concat = train_data[['msno', 'song_id', 'source_type', 'source_screen_name', 'timestamp']].append(test_data[['msno', \
'song_id', 'source_type', 'source_screen_name', 'timestamp']])
before_data = np.zeros((len(concat), 4))
差分特征
tmp = concat.groupby('msno').shift(1) # 往后移动一步
tmp.head()
before_data[:,0] = tmp['song_id'].fillna(-1).values
before_data[:,1] = tmp['source_screen_name'].fillna(-1).values
before_data[:,2] = tmp['source_type'].fillna(-1).values
before_data[:,3] = tmp['timestamp'].fillna(-1).values
after_data = np.zeros((len(concat), 4))
tmp = concat.groupby('msno').shift(-1) # 往后移动一步
after_data[:,0] = tmp['song_id'].fillna(-1).values
after_data[:,1] = tmp['source_screen_name'].fillna(-1).values
after_data[:,2] = tmp['source_type'].fillna(-1).values
after_data[:,3] = tmp['timestamp'].fillna(-1).values
print('data before done.')
## to_csv
idx = 0
for i in ['song_id', 'source_screen_name', 'source_type', 'timestamp']:
train_data['before_'+i] = before_data[:len(train_data), idx]
train_data['after_'+i] = after_data[:len(train_data), idx]
test_data['before_'+i] = before_data[len(train_data):, idx]
test_data['after_'+i] = after_data[len(train_data):, idx]
idx += 1
for i in ['song_id', 'source_type', 'source_screen_name']:
train_data['before_'+i] = train_data['before_'+i].astype(int)
test_data['before_'+i] = test_data['before_'+i].astype(int)
train_data['after_'+i] = train_data['after_'+i].astype(int)
test_data['after_'+i] = test_data['after_'+i].astype(int)
train_data['before_timestamp'] = train_data['timestamp'] - train_data['before_timestamp']
test_data['before_timestamp'] = test_data['timestamp'] - test_data['before_timestamp']
train_data['after_timestamp'] = train_data['after_timestamp'] - train_data['timestamp']
test_data['after_timestamp'] = test_data['after_timestamp'] - test_data['timestamp']
before_data_2 = np.zeros((len(concat), 4))
tmp = concat.groupby('msno').shift(2) # 往后移动一步
before_data_2[:,0] = tmp['song_id'].fillna(-1).values
before_data_2[:,1] = tmp['source_screen_name'].fillna(-1).values
before_data_2[:,2] = tmp['source_type'].fillna(-1).values
before_data_2[:,3] = tmp['timestamp'].fillna(-1).values
after_data_2 = np.zeros((len(concat), 4))
tmp = concat.groupby('msno').shift(-2) # 往后移动一步
after_data_2[:,0] = tmp['song_id'].fillna(-1).values
after_data_2[:,1] = tmp['source_screen_name'].fillna(-1).values
after_data_2[:,2] = tmp['source_type'].fillna(-1).values
after_data_2[:,3] = tmp['timestamp'].fillna(-1).values
print('data before done.')
## to_csv
idx = 0
for i in ['song_id', 'source_screen_name', 'source_type', 'timestamp']:
train_data['before_2_'+i] = before_data_2[:len(train_data), idx]
train_data['after_2_'+i] = after_data_2[:len(train_data), idx]
test_data['before_2_'+i] = before_data_2[len(train_data):, idx]
test_data['after_2_'+i] = after_data_2[len(train_data):, idx]
idx += 1
for i in ['song_id', 'source_type', 'source_screen_name']:
train_data['before_2_'+i] = train_data['before_2_'+i].astype(int)
test_data['before_2_'+i] = test_data['before_2_'+i].astype(int)
train_data['after_2_'+i] = train_data['after_2_'+i].astype(int)
test_data['after_2_'+i] = test_data['after_2_'+i].astype(int)
train_data['before_2_timestamp'] = train_data['timestamp'] - train_data['before_2_timestamp']
test_data['before_2_timestamp'] = test_data['timestamp'] - test_data['before_2_timestamp']
train_data['after_2_timestamp'] = train_data['after_timestamp'] - train_data['after_2_timestamp']
test_data['after_2_timestamp'] = test_data['after_timestamp'] - test_data['after_2_timestamp']
concat = train_data[['msno','song_id']].append(test_data[['msno','song_id']])
member_cnt = train_data['msno'].max() + 1
song_cnt = train_data['song_id'].max() + 1
genre_id_cnt = int(song['first_genre_id'].max() + 1)
uid对其他字段的svd特征
concat = concat.merge(song[['song_id','first_genre_id']])
n_components = 30
msno = concat['msno'].values
genre_id = concat['first_genre_id'].fillna(0).values
rating = sparse.coo_matrix((np.ones(len(concat)), (msno, genre_id.astype('int'))))
[u,s,vt] = svds(rating, k = n_components)
s_first_genre = np.diag(s[::-1])
members_topics = pd.DataFrame(u[:, ::-1])
members_topics.columns = ['member_first_gen_component_%d'%i for i in range(n_components)]
members_topics['msno'] = range(member_cnt)
members_data = members_data.merge(members_topics, on='msno', how='left')
first_gen_topics = pd.DataFrame(vt.transpose()[:, ::-1])
first_gen_topics.columns = ['first_gen_component_%d'%i for i in range(n_components)]
first_gen_topics['first_genre_id'] = range(genre_id_cnt)
song = song.merge(first_gen_topics, on='first_genre_id', how='left')
concat = train_data[['msno','song_id','source_type']].append(test_data[['msno','song_id','source_type']])
member_cnt = train_data['msno'].max() + 1
song_cnt = train_data['song_id'].max() + 1
source_type_cnt = int(train_data['source_type'].max() + 1)
msno = concat['msno'].values
source_type = concat['source_type'].fillna(0).values
rating = sparse.coo_matrix((np.ones(len(concat)), (msno, source_type)))
n_components = 10
[u,s,vt] = svds(rating, k = 10)
s_source_type = np.diag(s[::-1])
members_topics = pd.DataFrame(u[:, ::-1])
members_topics.columns = ['member_source_type_component_%d'%i for i in range(n_components)]
members_topics['msno'] = range(member_cnt)
members_data = members_data.merge(members_topics, on='msno', how='left')
source_type_topics = pd.DataFrame(vt.transpose()[:, ::-1])
source_type_topics.columns = ['source_type_component_%d'%i for i in range(n_components)]
source_type_topics['source_type'] = range(source_type_cnt)
各字段的组合特征,将其进行label编码
concat = train_data[['msno', 'song_id', 'source_system_tab', 'source_screen_name', 'source_type']].append(test_data[['msno', 'song_id', 'source_system_tab', \
'source_screen_name', 'source_type']])
concat = concat.merge(song[['song_id', 'song_length', 'first_artist_name_song_cnt', 'first_genre_id','first_genre_id_song_cnt','first_lyricist_name','first_lyricist_name_song_cnt']], on='song_id', how='left')
concat['source'] = concat['source_system_tab'] * 10000 + concat['source_screen_name'] * 100 + concat['source_type']
from sklearn.preprocessing import LabelEncoder
concat['source'] = LabelEncoder().fit_transform(concat['source'].values)
## member features
mem_add = pd.DataFrame({'msno': range(concat['msno'].max()+1)})
data_avg = concat[['msno', 'song_length', 'first_artist_name_song_cnt', 'first_genre_id_song_cnt', 'first_lyricist_name_song_cnt']].groupby('msno').mean()
data_avg.columns = ['msno_'+i+'_mean' for i in data_avg.columns]
# data_avg['msno'] = data_avg.index.values
mem_add = mem_add.merge(data_avg, on='msno', how='left')
data_std = concat[['msno', 'song_length', 'first_artist_name_song_cnt', \
'first_genre_id_song_cnt', 'first_lyricist_name_song_cnt']].groupby('msno').std()
data_std.columns = ['msno_'+i+'_std' for i in data_std.columns]
# data_std['msno'] = data_std.index.values
mem_add = mem_add.merge(data_std, on='msno', how='left')
concat = concat.merge(song[['song_id', 'first_artist_name','language']], on='song_id', how='left')
artist_msno = concat[['msno', 'first_artist_name']].groupby('msno').apply(lambda x: len(set(x['first_artist_name'].values)))
mem_add['artist_msno_cnt'] = artist_msno
mem_add['artist_msno_cnt'] = np.log1p(mem_add['artist_msno_cnt'])
language_dummy = pd.get_dummies(concat['language'])
language_dummy['msno'] = concat['msno'].values
language_prob = language_dummy.groupby('msno').mean()
language_prob.columns = ['msno_language_%d'%i for i in language_prob.columns]
# language_prob['msno'] = language_prob.index
mem_add = mem_add.merge(language_prob, on='msno', how='left')
col = ['first_artist_name', 'first_genre_id', 'language','source']
for feat in col:
concat['id'] = concat['msno'] * 100000 + concat[feat].fillna(0)
id_cnt = concat[['msno', 'id']].groupby('id').count().to_dict()['msno']
concat['msno_'+feat+'_cnt'] = concat['id'].apply(lambda x: id_cnt[x])
msno_cnt = concat[['msno', 'song_id']].groupby('msno').count().to_dict()['song_id']
concat['msno_cnt'] = concat['msno'].apply(lambda x: msno_cnt[x])
for feat in col:
concat['msno_'+feat+'_prob'] = concat['msno_'+feat+'_cnt'] / concat['msno_cnt']
cols = ['source_system_tab', 'source_screen_name', 'source_type']
for col in cols:
concat['id'] = concat['song_id'] * 10000 + concat[col].fillna(0)
id_cnt = concat[['msno', 'id']].groupby('id').count().to_dict()['msno']
concat['song_'+col+'_cnt'] = concat['id'].apply(lambda x: id_cnt[x])
song_cnt = concat[['msno', 'song_id']].groupby('song_id').count().to_dict()['msno']
concat['song_cnt'] = concat['song_id'].apply(lambda x: song_cnt[x])
for col in cols:
concat['song_'+col+'_prob'] = concat['song_'+col+'_cnt'] / concat['song_cnt']
cols_to_merge = [col for col in concat.columns if col not in train_data.columns]
cols_to_merge = [col for col in cols_to_merge if col not in song.columns]
cols_to_merge.remove('id')
cols_to_merge.remove('song_cnt')
cols_to_merge.append('msno')
cols_to_merge.append('song_id')
train_data = train_data.merge(concat[cols_to_merge], on=['msno','song_id'],how = 'left')
test_data = test_data.merge(concat[cols_to_merge], on=['msno','song_id'],how = 'left')
from collections import defaultdict
## continous index
concat = train_data[['msno', 'song_id']].append(test_data[['msno', 'song_id']])
concat['timestamp'] = range(len(concat))
msno_list = concat['msno'].values
song_list = concat['song_id'].values
## till_now count
msno_dict = defaultdict(lambda: 0)
song_dict = defaultdict(lambda: 0)
msno_till_now_cnt = np.zeros(len(concat))
song_till_now_cnt = np.zeros(len(concat))
for i in range(len(concat))[::-1]:
msno_till_now_cnt[i] = msno_dict[msno_list[i]]
msno_dict[msno_list[i]] += 1
song_till_now_cnt[i] = song_dict[song_list[i]]
song_dict[song_list[i]] += 1
concat['msno_till_now_opposite_cnt'] = msno_till_now_cnt
concat['song_till_now_opposite_cnt'] = song_till_now_cnt
上次访问的时间差
msno_last_time = concat.groupby('msno')['timestamp'].agg([('msno_timestamp','last')]).reset_index()
song_last_time = concat.groupby('song_id')['timestamp'].agg([('song_id_timestamp','last')]).reset_index()
concat = concat.merge(msno_last_time, on='msno', how='left')
concat = concat.merge(song_last_time, on='song_id', how='left')
concat['msno_2_now'] = concat['msno_timestamp'] - concat['timestamp']
concat['song_2_now'] = concat['song_id_timestamp'] - concat['timestamp']
保存文件
concat['nn_time_diff'] = concat.groupby('msno')['timestamp'].diff().values
concat['song_nn_time_diff'] = concat.groupby('song_id')['timestamp'].diff().values
cols = [col for col in concat.columns if col!='msno' and col !='song_id' and col !='timestamp']
for col in cols:
train_data[col] = concat[:len(train_data)][col].values
test_data[col] = concat[len(train_data):][col].values
path = './data/'
song_extra_info_data.to_hdf(path + 'songs_extra_id.csv',key ='wsdm')
songs_data.to_hdf(path + 'songs_id_cnt_irsc_svd.hdf',key ='wsdm')
song.to_hdf(path + 'song.hdf',key ='wsdm')
members_data.to_hdf(path + 'members_id_cnt_svd.hdf',key ='wsdm')
mem_add.to_hdf(path + 'members_add.hdf',key ='wsdm')
train_data.to_hdf(path + 'train_id_cnt_dot.hdf',key ='wsdm')
test_data.to_hdf(path + 'test_id_cnt_dot.hdf',key ='wsdm')
4.训练数据
由于特征扩增后数据集增大,经过类型转换后降低内存,内存依然不够用,固本人在复现的时候采用了一半的数据集。
all_data = train_data.copy()
all_data = all_data.merge(song, on='song_id', how='left')
all_data = all_data.merge(members_data, on='msno', how='left')
all_data = all_data.merge(mem_add, on='msno', how='left')
all_data = all_data.merge(song_extra_info_data, on='song_id', how='left')
train_len = int(all_data.shape[0] * 0.8)
train_data_ = all_data.iloc[:train_len]
test_data_ = all_data.iloc[train_len:]
train_features = [col for col in train_data_.columns if col!='target' and train_data_[col].dtypes!='O' and train_data_[col].dtype!='<M8[ns]']
train_label = 'target'
print(len(train_features))
dtrain = lgb.Dataset(train_data_[train_features],train_data_[train_label])
dval = lgb.Dataset(test_data_[train_features],test_data_[train_label],reference=dtrain)
params = {
'task':'train',
'num_leaves': 255,
# 'objective': 'multiclass',
# 'num_class':5,
'objective': 'binary',
'metric': 'auc',
'min_data_in_leaf': 15,
'learning_rate': 0.05,
'feature_fraction': 0.95,
'bagging_fraction': 0.95,
'bagging_freq': 5,
'max_bin':128,
'num_threads': 64,
'random_state':100
}
lgb_step9 = lgb.train(params, dtrain, num_boost_round=2000,valid_sets=[dtrain,dval], early_stopping_rounds=50,verbose_eval=10)
5.总结
这次通过对这个kaggle项目的学习,给了我在特征工程方面的更多参照,使自己在以后的工作和学习中有更多方向。