处理关系数据使用libFM块

最新推荐文章于 2024-05-13 09:57:39 发布

Chloezhao

最新推荐文章于 2024-05-13 09:57:39 发布

阅读量1.6k

点赞数

分类专栏： libFM

本文链接：https://blog.csdn.net/Chloezhao/article/details/53487088

版权

libFM 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

英文博文：https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/

train.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20

和test.libfm

0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

我会将它们合并,所以就会更容易的整个过程

dataset.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

所以如果我们想用块结构。

我们会有5个文件:

rel_user。 libfm(features 0,1 and 6-8 are users features)

~~0 0:1 6:1~~
~~0 1:1 8:1~~

但事实上你可以避免feature_id_number broken like(0 - 1,6 - 8),我们可以将它,所以(0 - 1 - > 0 - 1和6 - 8 - > 2 - 4)

0 0:1 2:1
0 1:1 4:1

rel_product。 libfm产品特性(features 2-5 and 9 are products features)同样的事情我们可以压缩:

~~0 2:1 9:12.5~~
~~0 3:1 9:20~~
~~0 4:1 9:78~~
~~0 5:1~~

到

0 0:1 4:12.5
0 1:1 4:20
0 2:1 4:78
0 3:1

rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /!\ we are using a 0 indexing)

0
0
0
1
1

rel_product.train (映射)

0
1
2
0
1

- file y.train which contains the ratings only

5
5
4
1
1

基本完成了…

现在您需要创建。 x和。 xt为用户文件块和产品。这个你需要脚本可用与libFM /bin/后编译它们。

./bin/convert –ifile rel_user.libfm –ofilex rel_user.x –ofiley rel_user.y

you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.

然后

./bin/transpose –ifile rel_user.x –ofile rel_user.xt

Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test

At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)

和运行:

./bin/libFM -task r -train y.train -test y.test –relation rel_user,rel_product -out output

它有点多余的问题,但我希望你明白这一点。

现在一个真实的例子

对于这个例子,我将使用ml-1m.zip你可以从MovieLens数据集在这里(100万评)

ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

movies.dat (sample) / Format: MovieID::Title::Genres

1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama

users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code

1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455

我将创建三个不同的模型。

Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)

Model 1 and 2 can be created using the following code:

 
 # -*- coding: utf-8 -*- 

 
 __author__  
 = 
  'Silbermann Thierry' 

 
 __license__  
 = 
  'WTFPL' 

 
 import 
  pandas as pd 

 
 import 
  numpy as np 

 
 def 
  create_libfm(w_filename, model_lvl 
 = 
 1 
 ): 

 
      
 # Load the data 

 
      
 file_ratings  
 = 
  'ratings.dat' 

 
      
 data_ratings  
 = 
  pd.read_csv(file_ratings, delimiter 
 = 
 '::' 
 , engine 
 = 
 'python' 
 , 

 
                  
 names 
 = 
 [ 
 'UserID' 
 ,  
 'MovieID' 
 ,  
 'Ratings' 
 ,  
 'Timestamp' 
 ]) 

 
      
 file_movies  
 = 
  'movies.dat' 

 
      
 data_movies  
 = 
  pd.read_csv(file_movies, delimiter 
 = 
 '::' 
 , engine 
 = 
 'python' 
 , 

 
                  
 names 
 = 
 [ 
 'MovieID' 
 ,  
 'Name' 
 ,  
 'Genre_list' 
 ]) 

 
      
 file_users  
 = 
  'users.dat' 

 
      
 data_users  
 = 
  pd.read_csv(file_users, delimiter 
 = 
 '::' 
 , engine 
 = 
 'python' 
 , 

 
                  
 names 
 = 
 [ 
 'UserID' 
 ,  
 'Genre' 
 ,  
 'Age' 
 ,  
 'Occupation' 
 ,  
 'ZipCode' 
 ]) 

 
      
 # Transform data 

 
      
 ratings  
 = 
  data_ratings[ 
 'Ratings' 
 ] 

 
      
 data_ratings  
 = 
  data_ratings.drop([ 
 'Ratings' 
 ,  
 'Timestamp' 
 ], axis 
 = 
 1 
 ) 

 
      
 data_movies  
 = 
  data_movies.drop([ 
 'Name' 
 ], axis 
 = 
 1 
 ) 

 
      
 list_genres  
 = 
  [genres.split( 
 '|' 
 )  
 for 
  genres  
 in 
  data_movies[ 
 'Genre_list' 
 ]] 

 
      
 set_genre  
 = 
  [item  
 for 
  sublist  
 in 
  list_genres  
 for 
  item  
 in 
  sublist] 

 
      
 data_users  
 = 
  data_users.drop([ 
 'ZipCode' 
 ], axis 
 = 
 1 
 ) 

 
 print 
  'Data loaded'

 
      
 # Map the data 

 
      
 offset_array  
 = 
  [ 
 0 
 ] 

 
      
 dict_array  
 = 
  [] 

 
      
 feat  
 = 
  [( 
 'UserID' 
 , data_ratings), ( 
 'MovieID' 
 , data_ratings)] 

 
      
 if 
  model_lvl >  
 1 
 : 

 
          
 feat.extend[( 
 'Genre' 
 , data_users), ( 
 'Age' 
 , data_users),  

 
              
 ( 
 'Occupation' 
 , data_users), ( 
 'Genre_list' 
 , data_movies)] 

 
      
 for 
  (feature_name, dataset)  
 in 
  feat: 

 
          
 uniq  
 = 
  np.unique(dataset[feature_name]) 

 
          
 offset_array.append( 
 len 
 (uniq)  
 + 
  offset_array[ 
 - 
 1 
 ]) 

 
          
 dict_array.append({key: value  
 + 
  offset_array[ 
 - 
 2 
 ]  

 
              
 for 
  value, key  
 in 
  enumerate 
 (uniq)}) 

 
 print 
  'Mapping done'

 
      
 # Create libFM file 

 
      
 w  
 = 
  open 
 (w_filename,  
 'w' 
 ) 

 
      
 for 
  i  
 in 
  range 
 (data_ratings.shape[ 
 0 
 ]): 

 
          
 s  
 = 
  "{0}" 
 . 
 format 
 (ratings[i]) 

 
          
 for 
  index_feat, (feature_name, dataset)  
 in 
  enumerate 
 (feat): 

 
              
 if 
  dataset[feature_name][i]  
 in 
  dict_array[index_feat]: 

 
                  
 s  
 + 
 = 
  " {0}:1" 
 . 
 format 
 ( 

 
                          
 dict_array[index_feat][dataset[feature_name][i]]  

 
 + 
  offset_array[index_feat]

)

 
          
 s  
 + 
 = 
  '\n' 

 
          
 w.write(s) 

 
      
 w.close() 

 
 if 
  __name__  
 = 
 = 
  '__main__' 
 : 

 
      
 create_libfm( 
 'model1.libfm' 
 ,  
 1 
 ) 

 
      
 create_libfm( 
 'model2.libfm' 
 ,  
 2 
 ) 

So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)

所以你最终得到 model1.libfm and model2.libfm。只需要将这些文件一分为二，来创建训练数据集和测试数据集，分别命名叫 train_m1.libfm, test_m1.libfm

然后你就跑libFM是这样的:

./libFM -train train_m1.libfm -test test_m1.libfm -task r -iter 20 -method mcmc -dim ‘1,1,8’ -output output_m1

但我猜你已经知道如何去做。

Chloezhao

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
处理关系数据使用libFM块

英文博文：https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/train.libfm5 0:1 2:1 6:1 9:12.55 0:1 3:1 6:1 9:204 0:1 4:1 6:1 9:781 1:1,2
复制链接

扫一扫

专栏目录