【数据挖掘案列分析案列-02】基于酒店文本描述的相似酒店推荐案列

在旅行规划中,选择合适的酒店是一个重要的决策。然而,面对众多的酒店选择,如何找到与个人偏好相匹配的酒店成为一个挑战。本文将介绍如何构建一个基于描述内容相似度的酒店推荐系统,通过分析Seattle_Hotels数据集,为用户提供个性化的酒店推荐。为其推荐相似度高的Top10个其他酒店。

一、搭建应该推荐系统的步骤

1、数据集介绍

Seattle_Hotels数据集是西雅图酒店数据,数据集下载地址,数据集包含三个字段:酒店姓名、地址、以及内容描述。其中每一行代表一个酒店,数据集的具体格式将在代码实现部分进行展示。

2、数据预处理

在构建推荐系统之前,我们需要对数据进行预处理。这包括处理缺失值、清洗数据、转换数据类型等。我们将使用Python中的Pandas库来加载和处理Seattle_Hotels数据集,并确保数据的完整性和一致性。

3、特征工程

为了构建推荐系统,我们需要从酒店数据中提取有意义的特征。这可以包括酒店的位置、星级评级、设施等。我们将使用适当的特征工程技术,如独热编码、标准化等,对特征进行处理,以便后续的相似度计算和推荐算法能够准确地工作。

4、相似度计算

基于相似度的推荐系统依赖于计算酒店之间的相似度。我们将介绍几种常用的相似度计算方法,如欧氏距离、余弦相似度等,并解释如何在Python中使用这些方法进行相似度计算。通过计算相似度,我们可以找到与用户喜好相近的酒店。

5、推荐算法

在计算酒店之间的相似度之后,我们可以根据用户的偏好和历史行为,使用推荐算法为用户生成个性化的酒店推荐列表。我们将介绍一些常用的推荐算法,如基于用户的协同过滤、基于物品的协同过滤等,并演示如何在Python中实现这些算法。

6、系统评估和改进

构建推荐系统后,我们需要对其进行评估和改进。我们将介绍一些常用的评估指标,如准确率、召回率等,来评估推荐系统的性能。如果系统表现不佳,我们还将讨论一些改进方法,如引入隐语义模型、使用深度学习等。

二、基于酒店文本描述来推荐相似酒店的python实现

导入相关的数据包

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import random
# # import cufflinks
# # from plotly.offline import iplot
# cufflinks.go_offline()

导入数据并查看数据

df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
df.head()

name	address	desc
0	Hilton Garden Seattle Downtown	1821 Boren Avenue, Seattle Washington 98101 USA	Located on the southern tip of Lake Union, the...
1	Sheraton Grand Seattle	1400 6th Avenue, Seattle, Washington 98101 USA	Located in the city's vibrant core, the Sherat...
2	Crowne Plaza Seattle Downtown	1113 6th Ave, Seattle, WA 98101	Located in the heart of downtown Seattle, the ...
3	Kimpton Hotel Monaco Seattle	1101 4th Ave, Seattle, WA98101	What?s near our hotel downtown Seattle locatio...
4	The Westin Seattle	1900 5th Avenue, Seattle, Washington 98101 USA	Situated amid incredible shopping and iconic a...

查看数据的维度和具体的描述信息

df.shape
# (152, 3)
df['desc'][100]
'On a budget in Seattle or looking for something different? The historic charm and "home away from home" atmosphere of The Baroness will be sure to make you feel like one of the family. Conveniently located on First Hill, we are proud to be part of the Virginia Mason Hospital campus and only minutes from Harborview Medical Center and Swedish Hospital. The Baroness Hotel is a great option for short or long term medical, patient or family stays. Whether you are visiting the area\'s world-class medical facilities or on a budget vacation, our goal is to ensure a wonderful stay. Guest Amenities: Complimentary Internet access, Two twin, one or two queen studios with mini fridge and microwave, Two twin or one queen suites with full kitchens, Laundry facilities available, Flat screen cable television with HBO, Complimentary local calls, Ice and vending machines located in the lobby, Coffee maker and hairdryers in all guestrooms, Room service available seven days a week from the Rhododendron Cafe, Limited wheelchair accessibility, Guest library and business center, Printing & fax services available, 100% non-smoking and pet free, Rooms are not air conditioned - fans are available, Self-parking available at Virginia Mason hospital for a fee.'

看一下酒店介绍中主要描述信息
将所有的描述进行CountVectorizer()特征数值计算,得到描述的描述的文本特征矩阵bag_of_words。

vec = CountVectorizer().fit(df['desc'])
bag_of_words = vec.transform(df['desc'])

bag_of_words.toarray()
array([[0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]], dtype=int64)

bag_of_words.shape
(152, 3200)

统计某一个词出现的次数

sum_words = bag_of_words.sum(axis=0)
sum_words
matrix([[ 1, 11, 11, ...,  2,  6,  2]], dtype=int64)
words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq
[('located', 108),
 ('on', 129),
 ('the', 1258),
 ('southern', 1),
 ('tip', 1),
 ('of', 536),
 ('lake', 41),
 ('union', 33),
 ('hilton', 12),
 ('garden', 11),
 ('inn', 89),
 ('seattle', 533),
 ('downtown', 133),
 ('hotel', 295),
 ('is', 271),
 ('perfectly', 6),
 ('for', 216),
 ('business', 87),
 ('and', 1062),
 ('leisure', 18),
 ('neighborhood', 35),
 ('home', 57),
 ('to', 471),
 ('numerous', 1),
 ('major', 12),
...
 ('driving', 3),
 ('those', 4),
 ('coming', 2),
 ('tac', 15),
 ...]

对词频统计的结果进行排序

words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
words_freq
[('the', 1258),
 ('and', 1062),
 ('of', 536),
 ('seattle', 533),
 ('to', 471),
 ('in', 449),
 ('our', 359),
 ('you', 304),
 ('hotel', 295),
 ('with', 280),
 ('is', 271),
 ('at', 231),
 ('from', 224),
 ('for', 216),
 ('your', 186),
 ('or', 161),
 ('center', 151),
 ('are', 136),
 ('downtown', 133),
 ('on', 129),
 ('we', 128),
 ('free', 123),
 ('as', 117),
 ('located', 108),
 ('rooms', 106),
...
 ('outdoors', 3),
 ('fans', 3),
 ('athletic', 3),
 ('begin', 3),
 ...]
def get_top_n_words(corpus,n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    return words_freq[:n]

获取词频统计的前20个词

common_words=get_top_n_words(df['desc'],20)
common_words
[('the', 1258),
 ('and', 1062),
 ('of', 536),
 ('seattle', 533),
 ('to', 471),
 ('in', 449),
 ('our', 359),
 ('you', 304),
 ('hotel', 295),
 ('with', 280),
 ('is', 271),
 ('at', 231),
 ('from', 224),
 ('for', 216),
 ('your', 186),
 ('or', 161),
 ('center', 151),
 ('are', 136),
 ('downtown', 133),
 ('on', 129)]

将词频统计的结果转换成DataFrame

df1 = pd.DataFrame(common_words,columns=['desc','count'])
df1.head()

desc	count
0	the	1258
1	and	1062
2	of	536
3	seattle	533
4	to	471
df1.groupby('desc').sum()['count'].sort_values().plot(kind='barh',yTitle='Count',linecolor='black',title='top 20 before remove stopwords')
def get_top_n_words(corpus,n=None):
    vec = CountVectorizer(stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    return words_freq[:n]

去英文停用词之后的前20个词频

common_words=get_top_n_words(df['desc'],20)
df2 = pd.DataFrame(common_words,columns=['desc','count'])
df2.groupby('desc').sum()['count'].sort_values().iplot(kind='barh',yTitle='Count',linecolor='black',title='top 20 after remove stopwords')
def get_top_n_words(corpus,n=None):
    vec = CountVectorizer(stop_words='english',ngram_range=(1,3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
    return words_freq[:n]
common_words=get_top_n_words(df['desc'],20)
df3 = pd.DataFrame(common_words,columns=['desc','count'])
df3.groupby('desc').sum()['count'].sort_values().iplot(kind='barh',yTitle='Count',linecolor='black',title='top 20 before remove stopwords-ngram_range=(2,2)')

描述的一些统计信息

df['word_count']=df['desc'].apply(lambda x:len(str(x).split()))

df.head()

name	address	desc	word_count
0	Hilton Garden Seattle Downtown	1821 Boren Avenue, Seattle Washington 98101 USA	Located on the southern tip of Lake Union, the...	184
1	Sheraton Grand Seattle	1400 6th Avenue, Seattle, Washington 98101 USA	Located in the city's vibrant core, the Sherat...	152
2	Crowne Plaza Seattle Downtown	1113 6th Ave, Seattle, WA 98101	Located in the heart of downtown Seattle, the ...	147
3	Kimpton Hotel Monaco Seattle	1101 4th Ave, Seattle, WA98101	What?s near our hotel downtown Seattle locatio...	150
4	The Westin Seattle	1900 5th Avenue, Seattle, Washington 98101 USA	Situated amid incredible shopping and iconic a...	151

词频可视化展示

df['word_count'].plot(kind='hist',bins=50)

文本处理

sub_replace = re.compile('[^0-9a-z #+_]')
#stopwords = set(stopwords.words('english'))
stopwods = ['the','a','an','in']
def clean_txt(text):
    text.lower()
    text = sub_replace.sub('',text)
    ' '.join(word for word in text.split() if word not in stopwords)
    return text
df['desc_clean'] = df['desc'].apply(clean_txt)
df.head()
name	address	desc	word_count	desc_clean
0	Hilton Garden Seattle Downtown	1821 Boren Avenue, Seattle Washington 98101 USA	Located on the southern tip of Lake Union, the...	184	ocated on the southern tip of ake nion the ilt...
1	Sheraton Grand Seattle	1400 6th Avenue, Seattle, Washington 98101 USA	Located in the city's vibrant core, the Sherat...	152	ocated in the citys vibrant core the heraton r...
2	Crowne Plaza Seattle Downtown	1113 6th Ave, Seattle, WA 98101	Located in the heart of downtown Seattle, the ...	147	ocated in the heart of downtown eattle the awa...
3	Kimpton Hotel Monaco Seattle	1101 4th Ave, Seattle, WA98101	What?s near our hotel downtown Seattle locatio...	150	hats near our hotel downtown eattle location h...
4	The Westin Seattle	1900 5th Avenue, Seattle, Washington 98101 USA	Situated amid incredible shopping and iconic a...	151	ituated amid incredible shopping and iconic at...
df['desc'][0]
df['desc_clean'][0]

相似度计算

df.set_index('name',inplace = True)
tf=TfidfVectorizer(analyzer='word',ngram_range=(1,3),stop_words='english')
tfidf_matrix=tf.fit_transform(df['desc_clean'])
tfidf_matrix.shape
(152, 27976)
cosine_similarity =linear_kernel(tfidf_matrix,tfidf_matrix)
cosine_similarity.shape
(152, 152)
cosine_similarity[0]
array([1.00000000e+00, 1.07618507e-02, 2.39000494e-02, 5.46873017e-03,
       2.64161143e-02, 1.05158253e-02, 1.70265099e-02, 1.26932177e-02,
       6.55905011e-03, 1.89826340e-02, 1.01682769e-02, 5.81427763e-03,
       8.97164751e-03, 5.11332703e-03, 6.98081551e-03, 1.46651716e-02,
       1.01506328e-02, 3.48428336e-02, 1.05628890e-02, 2.03920044e-02,
       2.31715424e-02, 8.66803402e-03, 4.19927749e-03, 1.25464260e-02,
       1.35516385e-02, 1.90864472e-02, 2.92211862e-02, 5.29767659e-03,
       2.34027898e-02, 1.84009370e-02, 1.11063777e-02, 3.24877554e-02,
       1.59088468e-02, 2.03903610e-02, 3.34542421e-02, 2.08424726e-02,
       6.37061770e-03, 7.22769959e-03, 1.76879937e-02, 3.40610778e-02,
       1.39733856e-02, 7.16109150e-03, 1.40189178e-02, 3.08597799e-02,
       3.31898710e-02, 1.32485388e-02, 3.49498978e-02, 1.03401842e-02,
       2.91144195e-02, 1.41758154e-02, 2.22237640e-02, 1.64940308e-02,
       3.11683463e-02, 1.59544326e-02, 2.61636177e-02, 1.26140542e-02,
       2.14668363e-02, 2.62642643e-02, 4.91030598e-03, 2.78596805e-02,
       1.96779398e-02, 9.81505558e-03, 3.88536015e-02, 2.78932747e-02,
       1.53453198e-02, 9.00494748e-03, 2.90988366e-02, 7.52572710e-03,
       1.50339228e-02, 7.23229675e-03, 2.08907559e-02, 1.46102170e-02,
       2.38744140e-02, 2.08593020e-02, 2.05556244e-02, 5.08364922e-02,
       2.49582978e-03, 1.22351607e-02, 9.69353352e-03, 2.47634675e-02,
       6.16721807e-03, 1.28568641e-02, 8.52080157e-04, 4.25496742e-03,
       1.19408976e-02, 3.78787891e-02, 8.76879249e-03, 2.78619543e-03,
       6.72632425e-03, 1.21664341e-02, 7.22174485e-03, 6.21120314e-03,
       9.28807898e-03, 5.01326402e-03, 1.47909582e-02, 1.18810730e-02,
       5.55255877e-03, 1.46679942e-02, 1.23004765e-02, 2.59809457e-02,
...
       1.49672160e-02, 1.59649598e-02, 2.58764614e-02, 5.00635020e-03,
       2.27410363e-02, 9.26581208e-03, 1.35304359e-02, 1.40490270e-02,
       1.66688259e-02, 2.27161327e-02, 2.78165984e-02, 3.70680069e-03,
       3.48439660e-03, 2.76986975e-03, 1.85339056e-02, 7.80938853e-03,
       3.97319010e-03, 8.70843653e-03, 2.53198268e-03, 7.08322188e-03])
indices = pd.Series(df.index)
indices[:5]
0    Hilton Garden Seattle Downtown
1            Sheraton Grand Seattle
2     Crowne Plaza Seattle Downtown
3     Kimpton Hotel Monaco Seattle 
4                The Westin Seattle
Name: name, dtype: object
def recommendations(name,cosine_similarity):
    recommended_hotels = []
    idx = indices[indices == name].index[0]
    score_series = pd.Series(cosine_similarity[idx]).sort_values(ascending=False)
    top_10_indexes = list(score_series[1:11].index)
    for i in top_10_indexes:
        recommended_hotels.append(list(df.index)[i])
    return recommended_hotels
recommendations('Hilton Garden Seattle Downtown',cosine_similarity)
['Staybridge Suites Seattle Downtown - Lake Union',
 'Silver Cloud Inn - Seattle Lake Union',
 'Residence Inn by Marriott Seattle Downtown/Lake Union',
 'MarQueen Hotel',
 'Embassy Suites by Hilton Seattle Tacoma International Airport',
 'Silver Cloud Hotel - Seattle Broadway',
 'The Loyal Inn',
 'Homewood Suites by Hilton Seattle Downtown',
 'Inn at Queen Anne',
 'SpringHill Suites Seattle\xa0Downtown']

至此,整个推荐的流程结束,具体的代码文件后续将打包上传。

  • 33
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论提供了有关酒店的大量信息。这个数据可用于许多nlp项目:推荐系统,情绪分析,同类酒店的图网,基于评论的酒店细分。该数据集包含25个城市的酒店列表和评论。 file/opensearch/documents/92885/hotelReviewsInAustin__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBali__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBangkok__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBarcelona__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBombay__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInChicago__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInDubai__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInHong Kong__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInIstanbul__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInLondon__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInMiami__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInMilan__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInNew York__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInOsaka__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInParis__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInPhuket__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInPrague__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInRome__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInSan Francisco__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInSantorini__en2019100120191005.csv file/opense
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

云天徽上

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值