airbnb机器学习模型_机器学习能否助您Airbnb业务增收

最新推荐文章于 2023-12-27 18:04:09 发布

weixin_26741341

最新推荐文章于 2023-12-27 18:04:09 发布

阅读量395

点赞数 1

文章标签：机器学习人工智能 python 大数据 java

原文链接：https://medium.com/@mt.svitek/can-machine-learning-be-a-revenue-booster-in-your-airbnb-business-522a7450b972

版权

airbnb机器学习模型

In this blog we will have a look at some basic approaches which are supposed to give us a clue and inspiration for using data science and machine learning techniques to improve an existing, or start a profitable, AirBnB business.

在此博客中，我们将介绍一些基本方法，这些方法应该为我们提供了使用数据科学和机器学习技术来改善现有的或开始盈利的AirBnB业务的线索和启发。

The Seattle AirBnB homes data-set, which we decided to use for research and analytical techniques demonstration, can be found at the link below.

西雅图AirBnB房屋数据集，我们决定将其用于研究和分析技术演示，可以在下面的链接中找到。

Seattle AirBnB Data

西雅图AirBnB数据

Following 3 questions are not the crucial questions to be answered in order to employ Data Science to increase our revenues. However, they can help one to find his/her crucial questions and they are good examples to demonstrate some technical solutions when trying to gain the answers.

为了使用Data Science来增加我们的收入，以下三个问题并不是要回答的关键问题。但是，它们可以帮助人们找到他/她的关键问题，并且是在尝试获得答案时展示一些技术解决方案的很好的例子。

1. Can we train a model which could predict rating with mae < 10?

1.我们可以训练一个可以预测mae <10的评级的模型吗？

2. Can we identify a useful set of features with meaningful impact on target?

2.我们能否确定一组对目标有重大影响的有用功能？

3. How do prices of apartments vary by number of beds available?

3.公寓的价格如何随可用床位数的变化而变化？

行动计划 (Action plan)

To have a good plan means to see a light at the end of the tunnel before we start with work.

有一个好的计划意味着在开始工作之前，请在隧道尽头看到一盏灯。

CRISP-DM stands for Cross Industry Standard Process for Data Mining. In our research we will try to follow the steps of this methodology (at least in order to answer first two questions).

CRISP-DM代表跨行业数据挖掘标准流程。在我们的研究中，我们将尝试遵循这种方法的步骤(至少是为了回答前两个问题)。

Image for post — CRISP-DM methodology (1996)

业务理解 (Business Understanding)

The Business Understanding is a very important part of the process and ignoring this step can lead to a lot of wasted time or completely ruined projects.

业务理解是流程中非常重要的一部分，而忽略此步骤可能会导致大量时间浪费或完全破坏项目。

This step usually includes talking to SME’s and business owners / business experts, reading articles about the business and many times it can overlap with Data Understanding phase as the data-set fields can give us a clue what to ask.

此步骤通常包括与SME和业务所有者/业务专家进行交谈，阅读有关业务的文章，并且由于数据集字段可以为我们提供线索，因此它可能与数据理解阶段有很多重叠。

I was a bit lucky in this case because some time ago I reached an AirBnB superhost badge so I knew how the business is running and what to expect. However for those who are not familiar with AirBnB hosting, I would recommend to start here:

在这种情况下，我有点幸运，因为前一段时间我拿到了AirBnB超级主机徽章，所以我知道公司的运作方式和期望。但是，对于那些不熟悉AirBnB托管的人，我建议从这里开始：

数据理解 (Data Understanding)

In Data Understanding part we preview the data and focus on key points like:

在“数据理解”部分中，我们预览数据并关注关键点，例如：

Basic data characteristics like shape and column data types
基本数据特征，例如形状和列数据类型
Column names
列名
Missing values across the rows & columns
行和列中缺少值
Distribution and count of values in the columns
列中的值的分布和计数
Having a closer look at the target and some other interesting and/or suspicious columns (features)
仔细查看目标以及其他一些有趣和/或可疑的列(功能)
Visualizing the data if need
可视化数据(如果需要)

For this project I did not need to install any necessary libraries to run the code except for the Anaconda distribution of Python (Python version 3.6).

对于这个项目，除了Python的Anaconda发行版(Python版本3.6)外，我不需要安装任何必要的库来运行代码。

When downloading the data you will receive several files. In our case, we have used only `listings.csv`.

下载数据时，您将收到几个文件。在我们的例子中，我们只使用了“ listings.csv ”。

We loaded the CSV file (CSV stands for comma-separated value) into a data-frame using pandas without any specific parameters (defaults only).

我们使用没有任何特定参数(仅默认值)的熊猫将CSV文件(CSV表示逗号分隔值)加载到数据框中。

There was 3818 rows and 92 columns in the raw data-set.

原始数据集中有3818行和92列 。

62 columns were of object type (text or date and numbers stored as text), 30 columns were of numeric type.

对象类型为62列(文本或日期和数字作为文本存储)，数字类型为30列。

In the first preview I realized that we need to get rid of some text fields to be able to preview data easily. However in a real life project I would recommend to have a closer look at these text fields too as they can turn into key features if a correct preprocessing & feature engineering is applied. Couple of text & url fields, that are not useful for research & model training, were identified and listed for later removal.

在第一次预览中，我意识到我们需要摆脱一些文本字段才能轻松预览数据。但是，在现实生活中的项目中，我建议也仔细研究这些文本字段，因为如果应用正确的预处理和特征工程，它们可能会变成关键特征。确定了几个文本和网址字段，这些字段对研究和模型培训没有用，并列出了这些字段，以供以后删除。

In the next steps we prepared a list of columns, that have more than 90% of missing values and also a list of columns containing single value and high cardinality fields.

在接下来的步骤中，我们准备了一个列列表，该列具有90％以上的缺失值，还准备了一个列列表，其中包含单值和高基数字段。

What is high-cardinality?

什么是高基数？

In the data science world, cardinality refers to the number of unique values contained in a particular column, or field, of a database / data-frame. So if we are about to train a model on a data-set and some columns contain high number of unique values (categorical), we should consider whether to remove these fields or whether to reduce the number of unique values by clipping (replacing unimportant values by one default value). Actions like this can have positive impact on our results.

在数据科学世界中，基数是指数据库/数据帧的特定列或字段中包含的唯一值的数量。因此，如果我们要在数据集上训练模型，并且某些列包含大量唯一值(分类)，则应考虑是删除这些字段还是通过裁剪(替换不重要的值)来减少唯一值的数量一个默认值)。这样的行动会对我们的结果产生积极影响。

Before we move to Data Preparation part we can try to find also perfectly correlated columns (features) and add some of them to the list for removal as having groups of perfectly correlated (or high-correlated) columns in data-set could slow down the training process, and what’s worse, it can lead to a biased model.

在转到“数据准备”部分之前，我们可以尝试查找完全相关的列(功能)并将它们中的一些添加到列表中以将其删除，因为在数据集中包含一组完全相关(或高相关)的列可能会降低培训过程，更糟糕的是，它可能会导致模型产生偏差。

资料准备 (Data Preparation)

This is a bit more funny part as all the struggling with unreadable data is over and we should already have a plan how to make our data-set nice and tidy.

这是一个更有趣的部分，因为所有处理无法读取的数据的工作都已经结束，我们应该已经制定了一个计划，以使我们的数据集更加整洁。

Data preparation for our project have 2 parts:

我们项目的数据准备包括两个部分：

Feature engineering
特征工程
Data preprocessing
数据预处理

In the Feature engineering section we replace the date object fields with timestamps and convert currency fields (object) into float type fields. (Feel free to try more advanced ways of feature engineering if possible.)

在“要素工程”部分中，我们用时间戳替换日期对象字段，并将货币字段(对象)转换为浮点类型字段。 (如果可能，请随意尝试更高级的特征工程方法。)

In Data preprocessing the unwanted columns are dropped, missing values in numeric columns are filled with appropriate mean value, missing values in categorical fields are replaced with empty string and last but not least — the rows with missing target values are removed.

在“数据预处理”中，将删除不需要的列，用适当的平均值填充数字列中的缺失值，用空字符串替换类别字段中的缺失值，最后但并非最不重要的是-删除目标值缺失的行。

What is the target? The target variable of a data-set is the feature of a data-set about which you want to gain a deeper understanding. A supervised machine learning algorithm uses historical data to learn patterns and uncover relationships between other features of your data-set and the target. In this project the ‘rating’, mentioned in question 1 is our target. The name of the feature (column) in the AirBnB data-set is actually ‘review_scores_rating’.

目标是什么？数据集的目标变量是数据集的功能，您需要对该功能进行更深入的了解。监督式机器学习算法使用历史数据来学习模式，并揭示数据集的其他特征与目标之间的关系。在这个项目中，问题1中提到的“评级”是我们的目标。 AirBnB数据集中的功能(列)名称实际上是'review_scores_rating' 。

After the Feature engineering and Data preprocessing functions were applied to the data-frame, we ended up with 3171 rows and 42 columns only.

将特征工程和数据预处理功能应用于数据框后，我们最终仅获得了3171行和42列 。

建模与评估 (Modeling & Evaluation)

This might be the final step in order to answer question 1. We train the model and calculate the score which in our case is represented by mean absolute error.

这可能是回答问题1的最后一步。我们训练模型并计算得分，在我们的案例中，该得分由平均绝对误差表示。

However there are some foregoing preparation steps to be done before we can fit (train) the model and evaluate it:

但是，在我们可以拟合(训练)模型并对其进行评估之前，需要完成一些准备工作：

Split data-frame into label & features.
将数据框拆分为标签和功能。
Dummy the categorical variables
虚拟分类变量
Split data into train & test
将数据分为训练和测试

Not sure what Dummy variables means? Categorical variables can be used directly in some machine learning classification algorithms, but they should be decomposed into dummy variables, if possible. A dummy variable is a binary variable coded as 1 or 0 to represent the presence or absence of a variable. When it comes to a regression algorithm, categorical variables need to be turned into numerical for sure. There are several approaches for such numerical encoding. In our case we use the one-hot-encoding implemented by pandas get_dummies function.

不确定虚拟变量的含义是什么？分类变量可以直接在某些机器学习分类算法中使用，但如果可能的话，应将其分解为虚拟变量。虚拟变量是编码为1或0的二进制变量，表示存在或不存在变量。当涉及回归算法时，必须将类别变量确定为数字。有几种用于这种数字编码的方法。在我们的例子中，我们使用了由熊猫get_dummies函数实现的一键编码。

Finally we can select the right algorithm and fit(train) and evaluate our model. The decision was made to use GradientBoostingRegressor.

最后，我们可以选择合适的算法并拟合(训练)并评估我们的模型。决定使用GradientBoostingRegressor。

Why Regressor? The main difference between Regression and Classification algorithms that Regression algorithms are used to predict the continuous values such as price, salary, age, etc. and Classification algorithms are used to predict/Classify the discrete values such as Male or Female, True or False, Spam or Not Spam, etc. Guess, where the ‘review_scores_rating’ belongs to.

为什么是回归器？ 回归算法和分类算法之间的主要区别在于： 回归算法用于预测连续 值，例如价格，工资，年龄等； 分类算法用于预测/分类离散值，例如，男性或女性，真或假，垃圾邮件或非垃圾邮件等。猜猜' review_scores_rating'所属的地方。

And how did it go with the mean absolute error of our predictions?

以及它与我们预测的平均绝对误差有何关系？

Due to the fact that our mae is something more than 4, the answer is obviously “Yes, we can train a model which could predict review_scores_rating with mae < 10.”

由于我们的mae大于4，因此答案显然是“是的，我们可以训练一个模型，该模型可以预测mae <10时的review_scores_rating”。

However I am pretty much sure that with better feature selection and more advanced feature engineering we could get even better results. Also I have to mention here that the “Evaluation” we use is very simple — basically we only use the testing mae score.

但是，我非常确定，通过更好的功能选择和更高级的功能工程，我们可以获得更好的结果。我还要在这里提到，我们使用的“评估”非常简单-基本上，我们仅使用测试mae得分。

The last part of CRISP-DM process should be “Deployment” — this is applicable more in a real-life project but let’s consider the next question solution our Deployment.

CRISP-DM流程的最后一部分应该是“部署”-这在现实项目中更适用，但是让我们考虑下一个问题解决方案“部署”。

部署？ (Deployment ?)

Question 2: Can we identify a useful set of features with meaningful impact on target?

问题2：我们能否确定一组对目标有重大影响的有用功能？

There are various data-science approaches for identifying important features. But whichever we choose, the best idea is to combine the approach with common sense and hands-on experience (if possible).

有多种用于识别重要特征的数据科学方法。但是无论我们选择哪种方法，最好的主意都是将这种方法与常识和动手经验相结合(如果可能)。

I decided to use the impurity-based ‘feature_importances_’ attribute of GradientBoosting Regressor.

我决定使用GradientBoosting Regressor基于杂质的'feature_importances_'属性。

The Pros are fast calculation and that it’s easy to retrieve but the Cons is biased approach, as it has a tendency to inflate the importance of continuous features or high-cardinality categorical variables. And this is why we should use our common sense too — not just extract the top rated features and focus on them.

优点是计算速度快，易于检索，但缺点是有偏见的，因为它倾向于夸大连续特征或高基数分类变量的重要性。这就是为什么我们也应该使用我们的常识-而不是仅仅提取最受好评的功能并专注于它们。

To get the appropriate information from the feature importance chart is not about selecting the top n features. Now it is the time to use common sense too (and let’s support it with a heatmap):

从功能重要性图表中获取适当的信息与选择前n个功能无关。现在也该使用常识了(让我们通过热图来支持它)：

The first feature in the chart is “number_of_reviews”. It’s obvious that this feature correlates with “host_since” and that it is important. More reviews mean more experience and also stabilized rating.

图表中的第一个功能是“ number_of_reviews ”。显然，此功能与“ host_since ”相关，并且很重要。更多评论意味着更多的经验，而且评分稳定。

Let’s have a look at the second feature — “neighbourhood_cleansed”. This is a categorical variable so I decided to use boxplot for top 20 areas. Even though it is not clear why there are such differences in review_score_rating across the areas, the plot can help us to decide whether to buy a property in an area (if we have this possibility) or if we should be more careful (and do some deeper research) in order to gain good ratings if we are already offering a property in an area like for example the University District.

让我们看一下第二个功能-“ neighbourhood_cleansed ”。这是一个类别变量，因此我决定对前20个区域使用boxplot。尽管目前尚不清楚为什么每个区域的review_score_rating会有这种差异，但该图可以帮助我们决定是否要在某个区域购买房产(如果有这种可能)，或者我们应该更加谨慎(并做一些调整)。如果我们已经在大学区等地区提供了物业，则可以获得更好的评级。

Now we can skip couple of features and have a look at the prices. Let’s say we have a 2-bed apartment and we need to set a price. Someone told us that lower price will help us gain better ratings. However from a simple scatterplot (where x=target and y=price ) we somehow cannot figure out if the person was right or wrong:

现在我们可以跳过几个功能，然后看看价格。假设我们有一个两床公寓，我们需要设定价格。有人告诉我们，较低的价格将有助于我们获得更好的收视率。但是，从简单的散点图(其中x = target和y = price)中，我们无法以某种方式判断此人是对还是错：

So a function was made which creates a scatterplot out of a grouped (by target) dataframe with median aggregate function applied to ‘price’. An upgrade (in compare with previous plot) is also removal of outliers:

因此，创建了一个函数，该函数根据(按目标)分组的数据框创建了散点图，并将中位数聚合函数应用于“价格 ”。升级(与以前的情节相比)还可以消除异常值：

So the previous chart shows, that lower prices have no impact on higher ratings. By contrast the higher scores ratings have higher price medians.

因此，上图显示，较低的价格不会对较高的收视率产生影响。相比之下， 分数越高，价格中位数越高 。

We don’t need to review all the features from feature importance list to be able to answer the question 2:

我们不需要查看功能重要性列表中的所有功能就可以回答问题2：

Yes, we can identify a useful set of features with meaningful impact on target however a deeper investigation of how the feature impacts the target is necessary.

是的，我们可以确定一组有用的功能，这些功能会对目标产生有意义的影响，但是有必要对功能如何影响目标进行更深入的研究。

Question 3: How do prices of apartments vary by number of beds available?

问题3：公寓的价格如何随可用床位数的变化而变化？

As we have a wide range of values (1 to 15) in the “beds” field, a simple clipping function was applied to reduce the range to only 4 categories.

由于“ beds”字段中的值范围很广(1到15)，因此应用了简单的裁剪功能将范围缩小到仅4个类别。

The boxplot above is the answer to the question 3. We can see that there is some price overlapping (even between the 1-bed & 4+ bed apartments). On the other hand each category has at least a bit greater median than the previous one. Categories 3 and 4+ have nearly identical median but the minimum of 4+ is somewhere around 100 while the minimum of category 3 is close to 60. It is interesting that the 1st quartile of 4+ category is actually lower than the one from category 3.

上面的箱线图是对问题3的回答。我们可以看到价格存在一些重叠(甚至在1床和4床以上的公寓之间)。另一方面，每个类别的中位数至少比上一个类别大。类别3和4+的中位数几乎相同，但类别4+的最小值在100左右，而类别3的最小值接近60。有趣的是类别4+的第一四分位数实际上低于类别3的四分位数。

结论 (Conclusions)

We can train a model with requested score but the question is how to apply it in real-life and how to benefit from it? At the moment it looks like the feature_importances attribute of our model is much more valuable than prediction itself.

我们可以训练具有要求分数的模型，但问题是如何在现实生活中应用它以及如何从中受益？目前看来，我们模型的feature_importances属性比预测本身更有价值。

The feature importances can help us to identify a useful set of features with meaningful impact on ‘review_scores_rating’, but we need to support these details with additional research and common sense.

功能的重要性可以帮助我们识别对“ review_scores_rating” 产生有意义影响的有用功能集，但是我们需要通过其他研究和常识来支持这些细节。

Number of beds in apartments affect the prices meaningfully, but there are many cases when even the 1-bed apartment is more expensive than the 4+ beds apartment. So it’s obvious that there are other aspects too.

公寓中的床位数会显着影响价格 ，但在许多情况下，即使1床公寓的价格也比4床以上的公寓贵。因此很明显，还有其他方面。

Overall the Data Science with Machine Learning can be a revenue booster in our AirBnB business. It can help us to do right decisions whether we need to support an existing hosting or start a new one. Whether we already have a property or planning to buy one. We just need to customize our business questions, focus on appropriate data, do a deep research and try various approaches.

总体而言，带机器学习的数据科学可以为我们的AirBnB业务带来收益。它可以帮助我们做出正确的决定，无论是需要支持现有主机还是启动新主机。无论我们已经有财产还是计划购买一个。我们只需要定制业务问题，关注适当的数据，进行深入研究并尝试各种方法即可。

All the related code and details can be found on Github: data-science blog.

所有相关代码和详细信息都可以在Github上找到： data-science blog 。

Matej Svitek

马捷·斯维特克(Matej Svitek)