0.理论
找出电影之间的相关度,相关度高的电影,认为它是相似电影。
相关度的衡量指标: pearson相关系数:pearson相关系数衡量的是线性相关关系。若r=0,只能说x与y之间无线性相关关系,不能说无相关关系。相关系数的绝对值越大,相关性越强:相关系数越接近于1或-1,相关度越强,相关系数越接近于0,相关度越弱。 spearman相关系数:它是衡量两个变量的依赖性的 非参数 指标。 它利用单调方程评价两个统计变量的相关性。 如果数据中没有重复值, 并且当两个变量完全单调相关时,斯皮尔曼相关系数则为+1或−1。
1.找出电影的相似度
import pandas as pd
r_cols = [ 'user_id' , 'movie_id' , 'rating' ]
ratings = pd. read_csv( 'E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.data' , sep= '\t' , names= r_cols, usecols= range ( 3 ) , encoding= "ISO-8859-1" )
m_cols = [ 'movie_id' , 'title' ]
movies = pd. read_csv( 'E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.item' , sep= '|' , names= m_cols, usecols= range ( 2 ) , encoding= "ISO-8859-1" )
ratings = pd. merge( movies, ratings)
ratings. head( )
movie_id title user_id rating 0 1 Toy Story (1995) 308 4 1 1 Toy Story (1995) 287 5 2 1 Toy Story (1995) 148 4 3 1 Toy Story (1995) 280 4 4 1 Toy Story (1995) 66 3
movieRatings = ratings. pivot_table( index= [ 'user_id' ] , columns= [ 'title' ] , values= 'rating' )
movieRatings. head( )
title 'Til There Was You (1997) 1-900 (1994) 101 Dalmatians (1996) 12 Angry Men (1957) 187 (1997) 2 Days in the Valley (1996) 20,000 Leagues Under the Sea (1954) 2001: A Space Odyssey (1968) 3 Ninjas: High Noon At Mega Mountain (1998) 39 Steps, The (1935) ... Yankee Zulu (1994) Year of the Horse (1997) You So Crazy (1994) Young Frankenstein (1974) Young Guns (1988) Young Guns II (1990) Young Poisoner's Handbook, The (1995) Zeus and Roxanne (1997) unknown Á köldum klaka (Cold Fever) (1994) user_id 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN 2.0 5.0 NaN NaN 3.0 4.0 NaN NaN ... NaN NaN NaN 5.0 3.0 NaN NaN NaN 4.0 NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 rows × 1664 columns
starWarsRatings = movieRatings[ 'Star Wars (1977)' ]
starWarsRatings. head( )
user_id
0 5.0
1 5.0
2 5.0
3 NaN
4 5.0
Name: Star Wars (1977), dtype: float64
similarMovies = movieRatings. corrwith( starWarsRatings)
similarMovies = similarMovies. dropna( )
df = pd. DataFrame( similarMovies)
df. head( 10 )
C:\Users\Administrator\anaconda3\lib\site-packages\numpy\lib\function_base.py:2845: RuntimeWarning: Degrees of freedom <= 0 for slice
c = cov(x, y, rowvar, dtype=dtype)
C:\Users\Administrator\anaconda3\lib\site-packages\numpy\lib\function_base.py:2704: RuntimeWarning: divide by zero encountered in divide
c *= np.true_divide(1, fact)
0 title 'Til There Was You (1997) 0.872872 1-900 (1994) -0.645497 101 Dalmatians (1996) 0.211132 12 Angry Men (1957) 0.184289 187 (1997) 0.027398 2 Days in the Valley (1996) 0.066654 20,000 Leagues Under the Sea (1954) 0.289768 2001: A Space Odyssey (1968) 0.230884 39 Steps, The (1935) 0.106453 8 1/2 (1963) -0.142977
similarMovies. sort_values( ascending= False )
title
Commandments (1997) 1.0
Cosi (1996) 1.0
No Escape (1994) 1.0
Stripes (1981) 1.0
Man of the Year (1995) 1.0
...
For Ever Mozart (1996) -1.0
Frankie Starlight (1995) -1.0
I Like It Like That (1994) -1.0
American Dream (1990) -1.0
Theodore Rex (1995) -1.0
Length: 1410, dtype: float64
import numpy as np
movieStats = ratings. groupby( 'title' ) . agg( { 'rating' : [ np. size, np. mean] } )
movieStats. head( )
rating size mean title 'Til There Was You (1997) 9 2.333333 1-900 (1994) 5 2.600000 101 Dalmatians (1996) 109 2.908257 12 Angry Men (1957) 125 4.344000 187 (1997) 41 3.024390
popularMovies = movieStats. loc[ movieStats[ 'rating' ] [ 'size' ] >= 100 ]
popularMovies. sort_values( [ ( 'rating' , 'mean' ) ] , ascending= False ) [ : 15 ]
rating size mean title Close Shave, A (1995) 112 4.491071 Schindler's List (1993) 298 4.466443 Wrong Trousers, The (1993) 118 4.466102 Casablanca (1942) 243 4.456790 Shawshank Redemption, The (1994) 283 4.445230 Rear Window (1954) 209 4.387560 Usual Suspects, The (1995) 267 4.385768 Star Wars (1977) 584 4.359589 12 Angry Men (1957) 125 4.344000 Citizen Kane (1941) 198 4.292929 To Kill a Mockingbird (1962) 219 4.292237 One Flew Over the Cuckoo's Nest (1975) 264 4.291667 Silence of the Lambs, The (1991) 390 4.289744 North by Northwest (1959) 179 4.284916 Godfather, The (1972) 413 4.283293
similarMovies = pd. DataFrame( similarMovies, columns= [ 'similarity' ] )
similarMovies. head( )
similarity title 'Til There Was You (1997) 0.872872 1-900 (1994) -0.645497 101 Dalmatians (1996) 0.211132 12 Angry Men (1957) 0.184289 187 (1997) 0.027398
df = popularMovies. join( similarMovies) . reset_index( )
df. head( )
C:\Users\Administrator\AppData\Local\Temp\ipykernel_888\955805574.py:2: FutureWarning: merging between different levels is deprecated and will be removed in a future version. (2 levels on the left, 1 on the right)
df = popularMovies.join(similarMovies).reset_index()
title (rating, size) (rating, mean) similarity 0 101 Dalmatians (1996) 109 2.908257 0.211132 1 12 Angry Men (1957) 125 4.344000 0.184289 2 2001: A Space Odyssey (1968) 259 3.969112 0.230884 3 Absolute Power (1997) 127 3.370079 0.085440 4 Abyss, The (1989) 151 3.589404 0.203709
df. sort_values( 'similarity' , ascending= False ) [ : 15 ]
title (rating, size) (rating, mean) similarity 295 Star Wars (1977) 584 4.359589 1.000000 99 Empire Strikes Back, The (1980) 368 4.206522 0.748353 255 Return of the Jedi (1983) 507 4.007890 0.672556 247 Raiders of the Lost Ark (1981) 420 4.252381 0.536117 24 Austin Powers: International Man of Mystery (1... 130 3.246154 0.377433 298 Sting, The (1973) 241 4.058091 0.367538 162 Indiana Jones and the Last Crusade (1989) 331 3.930514 0.350107 235 Pinocchio (1940) 101 3.673267 0.347868 119 Frighteners, The (1996) 115 3.234783 0.332729 176 L.A. Confidential (1997) 297 4.161616 0.319065 326 Wag the Dog (1997) 137 3.510949 0.318645 94 Dumbo (1941) 123 3.495935 0.317656 49 Bridge on the River Kwai, The (1957) 165 4.175758 0.316580 232 Philadelphia Story, The (1940) 104 4.115385 0.314272 200 Miracle on 34th Street (1994) 101 3.722772 0.310921
2.参考资料
Python数据科学与机器学习:从入门到实践 作者: [美]弗兰克•凯恩(Frank Kane)
源代码下载: https://www.ituring.com.cn/book/2426