项目描述:
基于西雅图酒店数据集,基于用户选择的酒店,为其推荐相似度高的Top10个其他酒店。
数据集包含三个字段:酒店姓名、地址、以及内容描述。
数据集展示:
方法步骤:
1.数据探索及导入相关包:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import random
pd.options.display.max_columns = 30
import matplotlib.pyplot as plt
%matplotlib inline
# 支持中文
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
# 数据探索
print(df.head())
print('数据集中的酒店个数:', len(df))
name \
0 Hilton Garden Seattle Downtown
1 Sheraton Grand Seattle
2 Crowne Plaza Seattle Downtown
3 Kimpton Hotel Monaco Seattle
4 The Westin Seattle
address \
0 1821Boren Avenue, Seattle Washington 98101 USA
1 14006th Avenue, Seattle, Washington 98101 USA
2 11136th Ave, Seattle, WA 98101
3 11014th Ave, Seattle, WA98101
4 19005th Avenue, Seattle, Washington 98101 USA
desc
0 Located on the southern tip of Lake Union, the...
1 Located in the city's vibrant core, the Sherat...
2 Located in the heart of downtown Seattle, the ...
3 What?s near our hotel downtown Seattle locatio...
4 Situated amid incredible shopping and iconic a...
数据集中的酒店个数: 152
def print_description(index):
example = df[df.index == index][['desc', 'name']].values[0]
if len(example) > 0:
print(example[0])
print('Name:', example[1])
print('第10个酒店的描述:')
print_description(10)
第10个酒店的描述:
Soak up the vibrant scene in the Living Room Bar and get in the mix with our live music and DJ series before heading to a memorable dinner at TRACE. Offering inspired seasonal fare in an award-winning atmosphere, it's a not-to-be-missed culinary experience in downtown Seattle. Work it all off the next morning at FIT®, our state-of-the-art fitness center before wandering out to explore many of the area's nearby attractions, including Pike Place Market, Pioneer Square and the Seattle Art Museum. As always, we've got you covered during your time at W Seattle with our signature Whatever/Whenever® service - your wish is truly our command.
Name: W Seattle
通过 CounterVectorizer建立三元词袋模型,统计酒店描述中,出现top20多的词。
# 得到酒店描述中n-gram特征中的TopK个
def get_top_n_words(corpus, n=1, k=None):
# 统计ngram词频矩阵
vec = CountVectorizer(ngram_range=(n, n), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
"""
print('feature names:')
print(vec.get_feature_names())
print('bag of words:')
print(bag_of_words.toarray())
"""
print('feature names:')
print(vec.get_feature_names())#获得所有文本的关键字
print('bag of words:')
print(bag_of_words.toarray())
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
# 按照词频从大到小排序
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:k]
common_words = get_top_n_words(df['desc'], 3, 20)
print(common_words)
df1 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df1.groupby('desc').sum()['count'].sort_values().plot(kind='barh', title='去掉停用词后,酒店描述中的Top20单词')
plt.show()
2.对文本进行预处理
# 文本预处理
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
# 对文本进行清洗
def clean_text(text):
# 全部小写
text = text.lower()
# 用空格替代一些特殊符号,如标点
text = REPLACE_BY_SPACE_RE.sub(' ', text)
# 移除BAD_SYMBOLS_RE
text = BAD_SYMBOLS_RE.sub('', text)
# 从文本中去掉停用词
text = ' '.join(word for word in text.split() if word not in STOPWORDS)
return text
# 对desc字段进行清理
df['desc_clean'] = df['desc'].apply(clean_text)
print(df['desc_clean'].head())
3.采用TF-IDF提取文本特征
# 建模
df.set_index('name', inplace = True)
# 使用TF-IDF提取文本特征
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0.01, stop_words='english')
#对文本数据进行tfidf特征表示
tfidf_matrix = tf.fit_transform(df['desc_clean'])
print('TFIDF feature names:')
print(tf.get_feature_names())
print(len(tf.get_feature_names()))
print('tfidf_matrix:')
print(tfidf_matrix)
print(tfidf_matrix.toarray())
print(tfidf_matrix.shape)
4.计算酒店之间的余弦相似度
# 计算酒店之间的余弦相似度(线性核函数)
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
print(cosine_similarities)
print(cosine_similarities.shape)
indices = pd.Series(df.index) #df.index是酒店名称
[[1. 0.0391713 0.10519839 ... 0.04506191 0.01188579 0.02732358]
[0.0391713 1. 0.06121291 ... 0.06131857 0.01508036 0.03706011]
[0.10519839 0.06121291 1. ... 0.09179925 0.04235642 0.05607314]
...
[0.04506191 0.06131857 0.09179925 ... 1. 0.0579826 0.04145794]
[0.01188579 0.01508036 0.04235642 ... 0.0579826 1. 0.0172546 ]
[0.02732358 0.03706011 0.05607314 ... 0.04145794 0.0172546 1. ]]
(152, 152)
5.基于相似度推荐top10的酒店
# 基于相似度矩阵和指定的酒店name,推荐TOP10酒店
def recommendations(name, cosine_similarities = cosine_similarities):
recommended_hotels = []
# 找到想要查询酒店名称的idx
idx = indices[indices == name].index[0]
print('idx=', idx)
# 对于idx酒店的余弦相似度向量按照从大到小进行排序
score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)
# 取相似度最大的前10个(除了自己以外)
top_10_indexes = list(score_series.iloc[1:11].index)
# 放到推荐列表中
for i in top_10_indexes:
recommended_hotels.append(list(df.index)[i])
return recommended_hotels
print(recommendations('Hilton Seattle Airport & Conference Center'))
print(recommendations('The Bacon Mansion Bed and Breakfast'))
推荐结果如下:
idx= 49
['Embassy Suites by Hilton Seattle Tacoma International Airport', 'DoubleTree by Hilton Hotel Seattle Airport', 'Seattle Airport Marriott', 'Motel 6 Seattle Sea-Tac Airport South', 'Knights Inn Tukwila', 'Four Points by Sheraton Downtown Seattle Center', 'Radisson Hotel Seattle Airport', 'Hampton Inn Seattle/Southcenter', 'Home2 Suites by Hilton Seattle Airport', 'Red Lion Hotel Seattle Airport Sea-Tac']
idx= 116
['11th Avenue Inn Bed and Breakfast', 'Shafer Baillie Mansion Bed & Breakfast', 'Gaslight Inn', 'Bed and Breakfast Inn Seattle', 'Chittenden House Bed and Breakfast', 'Hyatt House Seattle', 'Mozart Guest House', 'Silver Cloud Hotel - Seattle Broadway', 'WorldMark Seattle - The Camlin', 'Pensione Nichols Bed and Breakfast']
总结:
基于酒店内容推荐的一般步骤:
Step1,对酒店描述(Desc)进行特征提取 N-Gram,提取N个连续字的集合,作为特征 TF-IDF,按照(min_df, max_df)提取关键词,并生成TFIDF矩阵
Step2,计算酒店之间的相似度矩阵 余弦相似度
Step3,对于指定的酒店,选择相似度最大的Top-K个酒店进行输出
原文链接:https://blog.csdn.net/lu_yunjie/article/details/108060364