（12-3-06）动漫推荐系统：数据分析（6）

最新推荐文章于 2024-11-06 15:14:23 发布

码农三叔

最新推荐文章于 2024-11-06 15:14:23 发布

阅读量506

点赞数 10

分类专栏：推荐系统文章标签：数据分析数据挖掘 python 神经网络推荐算法

本文链接：https://blog.csdn.net/asd343442/article/details/137457490

版权

推荐系统专栏收录该内容

65 篇文章 13 订阅

订阅专栏

12.4.9 动漫类型

（1）下面开始探索动漫数据集中的类型，首先将动漫数据集中的类型字段按逗号拆分，并通过explode函数将其展开为单独的行。接着，对拆分后的类型进行标题化处理。最后，统计并输出唯一类型的总数以及每个类型的出现次数。

top_anime_temp3 = top_anime[["genre"]]
top_anime_temp3["genre"] = top_anime_temp3["genre"].str.split(", | , | ,")
top_anime_temp3 = top_anime_temp3.explode("genre")
top_anime_temp3["genre"] = top_anime_temp3["genre"].str.title()

print(f'Total unique genres are {len(top_anime_temp3["genre"].unique())}')
print(f'Occurrences of unique genres:')
top_anime_temp3["genre"].value_counts().to_frame().T.style.set_properties(**{"background-color": "#2a9d8f","color":"white","border": "1.5px  solid black"})

执行后会输出唯一类型的总数以及每个类型的出现次数，如图12-21所示。

图12-21 唯一类型的总数以及每个类型的出现次数

（2）使用 WordCloud 库创建了一个词云图，展示了动漫数据集中类型的分布。词云图的背景色为黑色，使用 RdYlGn 颜色映射，最大字体大小为100。最后，通过调用 show() 方法展示生成的词云图。

# 导入必要的库
from wordcloud import WordCloud

# 创建 WordCloud 对象
wordcloud = WordCloud(width=800, height=250, background_color="black", colormap="RdYlGn",
                      max_font_size=100, stopwords=None, repeat=True).generate(top_anime["genre"].str.cat(sep=", | , | ,"))

# 绘制词云图
print("Let's explore how genre's wordcloud looks like\n")
plt.figure(figsize=(20, 8), facecolor="#ffd100")
plt.imshow(wordcloud)
plt.axis("off")
plt.margins(x=0, y=0)
plt.tight_layout(pad=0)
plt.show()

生成的词云图效果如图12-22所示。

图12-22 词云图效果

12.4.10 最终数据预处理

（1）通过如下代码进行了最终的数据预处理。首先，将用户评分中的值为-1的替换为NaN。接着，通过 dropna 函数删除包含NaN值的行。最后，输出处理后数据中的空值数量。

data = fulldata.copy()
data["user_rating"].replace(to_replace=-1, value=np.nan, inplace=True)
data = data.dropna(axis=0)
print("Null values after final pre-processing :")
data.isna().sum().to_frame().T.style.set_properties(**{"background-color": "#2a9d8f","color":"white","border": "1.5px  solid black"})

（2）下面的代码首先计算了每个用户的评分数量，并筛选出至少有50个评分的用户。然后，通过 pivot_table 函数创建了一个以动漫名称为行、用户ID为列、用户评分为值的数据透视表。空缺值被填充为0。

selected_users = data["user_id"].value_counts()
data = data[data["user_id"].isin(selected_users[selected_users >= 50].index)]

data_pivot_temp = data.pivot_table(index="name", columns="user_id", values="user_rating").fillna(0)
data_pivot_temp.head()

执行后会输出：

user_id	3	5	7	11	14	17	21	23	24	27	...	73495	73499	73500	73501	73502	73503	73504	73507	73510	73515
name																					
&quot;0&quot;	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
&quot;Bungaku Shoujo&quot; Kyou no Oyatsu: Hatsukoi	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	10.0	0.0	0.0	0.0	0.0	0.0
&quot;Bungaku Shoujo&quot; Memoire	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	6.0	0.0
&quot;Bungaku Shoujo&quot; Movie	0.0	0.0	0.0	0.0	8.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	10.0	0.0	0.0	0.0	0.0	0.0
&quot;Eiji&quot;	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0
5 rows × 32967 columns

（3）下面这段代码定义了一个 text_cleaning 函数，用于清理动漫名称中的一些特殊字符。然后，应用该函数清理了数据集中的动漫名称，并通过 pivot_table 函数创建了一个以动漫名称为行、用户ID为列、用户评分为值的数据透视表。

def text_cleaning(text):
    text = re.sub(r'&quot;', '', text)
    text = re.sub(r'.hack//', '', text)
    text = re.sub(r'&#039;', '', text)
    text = re.sub(r'A&#039;s', '', text)
    text = re.sub(r'I&#039;', 'I\'', text)
    text = re.sub(r'&amp;', 'and', text)
    
    return text

data["name"] = data["name"].apply(text_cleaning)

data_pivot = data.pivot_table(index="name", columns="user_id", values="user_rating").fillna(0)
print("After Cleaning the anime names, let's see how it looks like.")
data_pivot.head()

执行后会输出：

user_id	3	5	7	11	14	17	21	23	24	27	...	73495	73499	73500	73501	73502	73503	73504	73507	73510	73515
name																					
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
001	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
009 Re:Cyborg	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
009-1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
009-1: RandB	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5 rows × 32967 columns