饭店点餐系统的需求分析_酒店评论的情绪分析主题建模

饭店点餐系统的需求分析Web scraping, Sentiment analysis, LDA topic modeling网站抓取,情感分析,LDA主题建模项目概况(Project Overview)In this project, we are going to scrape hotel reviews of “Hotel Beresford” located in San Franci...
摘要由CSDN通过智能技术生成

饭店点餐系统的需求分析

Web scraping, Sentiment analysis, LDA topic modeling

网站抓取,情感分析,LDA主题建模

项目概况(Project Overview)

In this project, we are going to scrape hotel reviews of “Hotel Beresford” located in San Francisco, CA from the website bookings.com. Then, we are going to do some data exploration, generate WordClouds, perform sentiment analysis and create an LDA topic model.

在本项目中,我们将从bookings.com网站上删除位于加利福尼亚州旧金山的“ Hotel Beresford”酒店点评。 然后,我们将进行一些数据探索,生成WordCloud,执行情感分析并创建LDA主题模型。

问题陈述 (Problem Statement)

The project goal is to use text analytics and Natural Language Processing (NLP) to extract actionable insights from the reviews and help the hotel improve their guest satisfactions.

该项目的目标是使用文本分析和自然语言处理(NLP)从评论中提取可行的见解,并帮助酒店提高客人的满意度。

方法论 (Methodologies)

(1) Web Scraping

(1)网页抓取

The hotel reviews will be scraped from bookings.com by using requests with BeautifulSoup. The detailed steps are covered in the next section.

通过使用BeautifulSoup的请求,将从bookings.com删除酒店评论。 下一节将介绍详细步骤。

(2) Exploratory Data Analysis (EDA)

(2)探索性数据分析(EDA)

We will use pie chart, histogram, and seaborn violin plot to get a better understanding of the reviews and ratings data.

我们将使用饼图,直方图和seaborn小提琴图来更好地了解评论和评级数据。

(3) WordClouds

(3)字云

In order to generate more meaningful WordClouds, we will customize some extra stop words and use lemmatization to remove closely redundant words.

为了生成更有意义的词云,我们将自定义一些额外的停用词,并使用词形化技术来删除紧密冗余的词。

(4) Sentiment Analysis

(4)情感分析

The sentiment analysis helps to classify the polarity and subjectivity of the overall reviews and determine whether the expressed opinion in the reviews is mostly positive, negative, or neutral.

情绪分析有助于对总体评论的极性和主观性进行分类,并确定评论中表达的观点大部分是正面,负面或中立的。

(5) LDA Topic Model

(5)LDA主题模型

In natural language processing, the latent Dirichlet allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. We will use GridSearch to find the best topic model. The two tuning parameters are: (1) n_components: number of topics and (2) learning_decay (which controls the learning rate)

在自然语言处理中,潜在的狄利克雷分配是一种生成的统计模型,该模型允许未观察到的组解释观察集,这些观察组解释了为什么数据的某些部分相似。 我们将使用GridSearch查找最佳主题模型。 这两个调整参数是:(1) n_components :主题数;(2) learning_decay (控制学习率)

指标 (Metrics)

To diagnose the model performance, we will take a look at the perplexity and log-likelihood scores of the LDA model.

为了诊断模型的性能,我们将研究LDA模型的困惑度和对数似然分数。

Perplexity captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. Log-likelihood is a measure of how plausible model parameters are given the data.

困惑捕获了一个模型对从未见过的新数据感到惊讶的程度,并将其度量为保持测试集的标准化对数似然率。 对数似然性是对给定数据合理模型参数的一种度量。

A model with higher log-likelihood and lower perplexity is considered to be a good model. However, perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words. (Read this article to learn more)

具有较高对数可能性和较低困惑度的模型被认为是一个很好的模型。 但是,困惑可能不是评估主题模型的最佳方法,因为它没有考虑单词之间的上下文和语义关联。 (阅读本文以了解更多信息)

如何取消评论? (How to Scrape the Reviews?)

In this project, we are going to scrape the reviews of “Hotel Beresford” located in San Francisco, CA . To scrape any websites, we need to first find the pattern of the URL and then inspect the web page. However, we see that this link is extremely long.

在此项目中,我们将刮除位于加利福尼亚州旧金山的“ Hotel Beresford”的评论。 要抓取任何网站,我们需要首先找到URL的模式,然后检查网页。 但是,我们看到此链接非常长。

https://www.booking.com/reviews/us/hotel/beresford.html?label=gen173nr-1FCA0o7AFCCWJlcmVzZm9yZEgzWARokQKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4ArGo6_oFwAIB0gIkODFmNjUzODEtMWU5Ny00ZjIzLWI2MWEtYjBjZGU2NzI0ZWYz2AIF4AIB;sid=a1571564bf0a35365a839937489b2ef6;customer_type=total;hp_nav=0;old_page=0;order=featuredreviews;page=1;r_lang=en;rows=75&

After several tries, you will realize that using the link below can also generate the same page. If you want to scrape other hotels, simply replace “beresford” with any other hotel names booking.com uses. Entering a page number at the end would bring you to the review page you want to see.

经过几次尝试,您将意识到使用下面的链接也可以生成相同的页面。 如果要刮擦其他酒店,只需将“ beresford”替换为booking.com使用的任何其他酒店名称。 在末尾输入页码将带您进入要查看的评论页面。

https://www.booking.com/reviews/us/hotel/beresford.html?page=

Right click anywhere on the web page and select “Inspect” to view the HTML & CSS script of web elements. Here we find the html tags of the review section we want to scrape is “ul.review_list”.

右键单击网页上的任意位置,然后选择“检查”以查看Web元素HTML和CSS脚本。 在这里,我们发现要抓取的评论部分的html标签为“ ul.review_list”。

Image for post
Review Section that includes reviewers’ information and both positive and negative reviews
审核部分,其中包括审核者的信息以及正面和负面评论

Under this tag, we want to scrape the following information:

在此标签下,我们要抓取以下信息:

1. Basic information of the reviewer and reviews:

1.审稿人和审稿的基本信息:

  • Rating Score

    评分分数

  • Reviewer Name

    审稿人姓名

  • Reviewer’s Nationality

    审稿人的国籍

  • Overall Review (contains both positive & negative reviews)

    总体评价(包含正面和负面评论)

  • Reviewer Reviewed Times

    评论者评论时间

  • Review Date

    审核日期

  • Review Tags (Trip type, such as business trip, leisure trip, etc.)

    审核标签(旅行类型,例如商务旅行,休闲旅行等)

2. Positive reviews

2.正面评价

3. Negative Reviews

3.负面评价

Image for post
Image for post
Image for post
Reviewer’s info | Ratings | Review date
评论者信息| 评分| 审核日期
Image for post
  • 1
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值