基于内容推荐python_基于内容的推荐—为酒店建立内容推荐

该项目基于西雅图酒店数据集,利用N-gram、TF-IDF和余弦相似度为用户推荐相似酒店。首先,对酒店描述进行预处理,去除停用词并统计词频。接着,构建TF-IDF特征矩阵,计算各酒店间的余弦相似度,最后推荐与用户选择相似的Top10酒店。
摘要由CSDN通过智能技术生成

项目描述:

基于西雅图酒店数据集,基于用户选择的酒店,为其推荐相似度高的Top10个其他酒店。

数据集包含三个字段:酒店姓名、地址、以及内容描述。

数据集展示:

30a7e5c032432ab91a1542ecefe9e282.png

方法步骤:

1.数据探索及导入相关包:

import pandas as pd

import numpy as np

from nltk.corpus import stopwords

from sklearn.metrics.pairwise import linear_kernel

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import LatentDirichletAllocation

import re

import random

pd.options.display.max_columns = 30

import matplotlib.pyplot as plt

%matplotlib inline

# 支持中文

plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签

df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")

# 数据探索

print(df.head())

print('数据集中的酒店个数:', len(df))

name \

0 Hilton Garden Seattle Downtown

1 Sheraton Grand Seattle

2 Crowne Plaza Seattle Downtown

3 Kimpton Hotel Monaco Seattle

4 The Westin Seattle

address \

0 1821Boren Avenue, Seattle Washington 98101 USA

1 14006th Avenue, Seattle, Washington 98101 USA

2 11136th Ave, Seattle, WA 98101

3 11014th Ave, Seattle, WA98101

4 19005th Avenue, Seattle, Washington 98101 USA

desc

0 Located on the southern tip of Lake Union, the...

1 Located in the city's vibrant core, the Sherat...

2 Located in the heart of downtown Seattle, the ...

3 What?s near our hotel downtown Seattle locatio...

4 Situated amid incredible shopping and iconic a...

数据集中的酒店个数: 152

def print_description(index):

example = df[df.index == index][['desc', 'name']].values[0]

if len(example) > 0:

print(example[0])

print('Name:', example[1])

print('第10个酒店的描述:')

print_description(10)

第10个酒店的描述:

Soak up the vibrant scene in the Living Room Bar and get in the mix with our live music and DJ series before heading to a memorable dinner at TRACE. Offering inspired seasonal fare in an award-winning atmosphere, it's a not-to-be-missed culinary experience in downtown Seattle. Work it all off the next morning at FIT®, our state-of-the-art fitness center before wandering out to explore many of the area's nearby attractions, including Pike Place Market, Pioneer Square and the Seattle Art Museum. As always, we've got you covered during your time at W Seattle with our signature Whatever/Whenever® service - your wish is truly our command.

Name: W Seattle

通过 CounterVectorizer建立三元词袋模型,统计酒店描述中,出现top20多的词。

# 得到酒店描述中n-gram特征中的TopK个

def get_top_n_words(corpus, n=1, k=None):

# 统计ngram词频矩阵

vec = CountVectorizer(ngram_range=(n, n), stop_words='english').fit(corpus)

bag_of_words = vec.transform(corpus)

"""

print('feature names:')

print(vec.get_feature_names())

print('bag of words:')

print(bag_of_words.toarray())

"""

print('feature names:')

print(vec.get_feature_names())#获得所有文本的关键字

print('bag of words:')

print(bag_of_words.toarray())

sum_words = bag_of_words.sum(axis=0)

words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

# 按照词频从大到小排序

words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

return words_freq[:k]

common_words = get_top_n_words(df['desc'], 3, 20)

print(common_words)

df1 = pd.DataFrame(common_words, columns = ['desc' , 'count'])

df1.groupby('desc').sum()['count'].sort_values().plot(kind='barh', title='去掉停用词后,酒店描述中的Top20单词')

plt.show()

73a59ccc1ea6f8c6b1b1e61e1890576a.png

2.对文本进行预处理

# 文本预处理

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')

BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

STOPWORDS = set(stopwords.words('english'))

# 对文本进行清洗

def clean_text(text):

# 全部小写

text = text.lower()

# 用空格替代一些特殊符号,如标点

text = REPLACE_BY_SPACE_RE.sub(' ', text)

# 移除BAD_SYMBOLS_RE

text = BAD_SYMBOLS_RE.sub('', text)

# 从文本中去掉停用词

text = ' '.join(word for word in text.split() if word not in STOPWORDS)

return text

# 对desc字段进行清理

df['desc_clean'] = df['desc'].apply(clean_text)

print(df['desc_clean'].head())

3.采用TF-IDF提取文本特征

# 建模

df.set_index('name', inplace = True)

# 使用TF-IDF提取文本特征

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0.01, stop_words='english')

#对文本数据进行tfidf特征表示

tfidf_matrix = tf.fit_transform(df['desc_clean'])

print('TFIDF feature names:')

print(tf.get_feature_names())

print(len(tf.get_feature_names()))

print('tfidf_matrix:')

print(tfidf_matrix)

print(tfidf_matrix.toarray())

print(tfidf_matrix.shape)

4.计算酒店之间的余弦相似度

# 计算酒店之间的余弦相似度(线性核函数)

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

print(cosine_similarities)

print(cosine_similarities.shape)

indices = pd.Series(df.index) #df.index是酒店名称

[[1. 0.0391713 0.10519839 ... 0.04506191 0.01188579 0.02732358]

[0.0391713 1. 0.06121291 ... 0.06131857 0.01508036 0.03706011]

[0.10519839 0.06121291 1. ... 0.09179925 0.04235642 0.05607314]

...

[0.04506191 0.06131857 0.09179925 ... 1. 0.0579826 0.04145794]

[0.01188579 0.01508036 0.04235642 ... 0.0579826 1. 0.0172546 ]

[0.02732358 0.03706011 0.05607314 ... 0.04145794 0.0172546 1. ]]

(152, 152)

5.基于相似度推荐top10的酒店

# 基于相似度矩阵和指定的酒店name,推荐TOP10酒店

def recommendations(name, cosine_similarities = cosine_similarities):

recommended_hotels = []

# 找到想要查询酒店名称的idx

idx = indices[indices == name].index[0]

print('idx=', idx)

# 对于idx酒店的余弦相似度向量按照从大到小进行排序

score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)

# 取相似度最大的前10个(除了自己以外)

top_10_indexes = list(score_series.iloc[1:11].index)

# 放到推荐列表中

for i in top_10_indexes:

recommended_hotels.append(list(df.index)[i])

return recommended_hotels

print(recommendations('Hilton Seattle Airport & Conference Center'))

print(recommendations('The Bacon Mansion Bed and Breakfast'))

推荐结果如下:

idx= 49

['Embassy Suites by Hilton Seattle Tacoma International Airport', 'DoubleTree by Hilton Hotel Seattle Airport', 'Seattle Airport Marriott', 'Motel 6 Seattle Sea-Tac Airport South', 'Knights Inn Tukwila', 'Four Points by Sheraton Downtown Seattle Center', 'Radisson Hotel Seattle Airport', 'Hampton Inn Seattle/Southcenter', 'Home2 Suites by Hilton Seattle Airport', 'Red Lion Hotel Seattle Airport Sea-Tac']

idx= 116

['11th Avenue Inn Bed and Breakfast', 'Shafer Baillie Mansion Bed & Breakfast', 'Gaslight Inn', 'Bed and Breakfast Inn Seattle', 'Chittenden House Bed and Breakfast', 'Hyatt House Seattle', 'Mozart Guest House', 'Silver Cloud Hotel - Seattle Broadway', 'WorldMark Seattle - The Camlin', 'Pensione Nichols Bed and Breakfast']

总结:

基于酒店内容推荐的一般步骤:

Step1,对酒店描述(Desc)进行特征提取 N-Gram,提取N个连续字的集合,作为特征 TF-IDF,按照(min_df, max_df)提取关键词,并生成TFIDF矩阵

Step2,计算酒店之间的相似度矩阵 余弦相似度

Step3,对于指定的酒店,选择相似度最大的Top-K个酒店进行输出

原文链接:https://blog.csdn.net/lu_yunjie/article/details/108060364

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值