机器学习助您一臂之力。第1部分

最新推荐文章于 2024-09-20 17:33:31 发布

cullen2012

最新推荐文章于 2024-09-20 17:33:31 发布

阅读量485

点赞数

文章标签： python 机器学习人工智能数据分析大数据

原文链接：https://habr.com/en/post/468053/

版权

Have you ever looked for a flat? Would you like to add some machine learning and make a process more interesting?

您是否曾经寻找过公寓？您想增加一些机器学习并使过程更有趣吗？

Today we will consider applying Machine Learning for finding an optimal flat.

今天，我们将考虑应用机器学习来找到最佳单位。

介绍 (Introduction)

First of all, I want to clarify this moment and explain what "an optimal flat" does mean. It is a flat with a set of different characteristics like "area", "district", "number of balconies" and so on. And for these features of the flat, we expect a specific price. Looks like a function which takes several parameters and returns a number. Or maybe a black box which provides some magic.

首先，我想澄清这一刻，并解释“最佳公寓”的含义。这是一个具有一系列不同特征的单位，例如“区域”，“区域”，“阳台数量”等。对于这些功能，我们希望有一个特定的价格。看起来像一个需要几个参数并返回数字的函数。也许还有一个黑匣子可以提供一些魔力。

But… there is big "but", sometimes you can face a flat which is overpriced because of a set of reasons like a good geo-position. Also, there are more prestigious districts in the centre of a city and districts outside the town. Or… sometimes people want to sell their apartments because they move to another point of Earth. In other words, there are many factors which can affect the price. Does it sound familiar?

但是……有一个很大的“ but”，有时您可能会遇到由于地理位置等一系列原因而被高估的公寓。此外，城市中心还有城镇以外的地区，都享有较高的声誉。或者……有时人们想出售他们的公寓，因为他们搬到了地球的另一个地方。换句话说，有许多因素会影响价格。听起来很熟悉吗？

少留一点 (Little step aside)

Before I continue, let me make a little lyrical digression. I lived in the Yekaterinburg (the city between Europe and Asia, one of the cities which had held The Football World Championship in 2018) for 5 years.

在继续之前，让我作一点抒情离题。我住在叶卡捷琳堡(欧洲和亚洲之间的城市，这是2018年举办足球世界锦标赛的城市之一)已有5年。

I was in love with these concrete jungles. And I hated that city for winter and public transport. It is a growing city and every month there are thousands and thousands of flats to be sold.

我爱上了这些水泥丛林。我讨厌那个城市的冬季和公共交通。这是一个发展中的城市，每个月都有成千上万的公寓要出售。

Yes, it is an overcrowded, polluted city. At the same time - it is a good place for analysing a real estate market. I received a lot of advertisements for flats, from the Internet. And I will use that information to a further extent.

是的，这是一个人满为患，污染严重的城市。同时-这是分析房地产市场的好地方。我从互联网上收到了很多关于公寓的广告。我将进一步使用该信息。

Also, I tried to visualize different offers on Yekaterinburg's map. Yes, it the catch-eye picture from habracut, it made on Kepler.gl

另外，我尝试在叶卡捷琳堡地图上形象化显示不同的报价。是的，这是来自habracut的引人注目的照片，是在Kepler.gl上制作的

There are over 2 thousand 1-bedroom flat's which had been sold in July of 2019 in Yekaterinburg. They had a different price, from less than a million to almost 14 million roubles.

2019年7月，在叶卡捷琳堡出售了超过2000套一居室公寓。它们的价格有所不同，从不到一百万卢布到近一千四百万卢布。

These points refer to their geo-position. Colour of points on the map represents the price, the lower is the price near to blue colour, the higher is the price near to red. You can consider it as an analogy with cold and warm colours, the warmer colour is the bigger is price. Please, remember that moment, the redder is the colour, the higher is a value of something. The same idea works for blue but in the direction of the lowest price.

这些点是指它们的地理位置。地图上点的颜色表示价格，蓝色附近的价格越低，红色附近的价格越高。您可以将其视为冷色和暖色的类比，暖色是价格越大。请记住那一刻，红色是颜色，价值越高。相同的想法适用于蓝色，但价格最低。

Now you are having a general overview of the picture and time to analyse is coming.

现在，您对图片有了大致的了解，分析的时间到了。

目标 (Goal)

What did I want when l lived in Yekaterinburg? I looked for a good enough flat, or if we talk in terms of ML - I wanted to build a model which will give me a recommendation about buying.

我住在叶卡捷琳堡时想要什么？我想找一个足够好的公寓，或者如果我们谈论ML，我想建立一个模型，该模型会给我有关购买的建议。

On the one hand, if a flat is overpriced, the model should recommend waiting for price decreasing by showing the expected price for that flat. On the other hand - if a price is good enough, according to the market state - perhaps I should consider that offer.

一方面，如果某个公寓的定价过高，则该模型应建议通过显示该公寓的预期价格来等待价格下降。另一方面-根据市场状况，如果价格足够好-也许我应该考虑该报价。

Of course, there is nothing ideal and I was ready to accept a mistake in calculations. Usually, for this kind of task use mean error of prediction and I was ready to 10% error. For example, if you have 2–3 million Russian roubles, you can ignore mistake in 200–300 thousand, you can afford it. As it seemed to me.

当然，没有什么理想的，我准备接受计算中的错误。通常，对于此类任务，请使用平均预测误差，并且我准备好10％的误差。例如，如果您有2到300万俄罗斯卢布，那么您可以忽略20万到30万卢布的错误，您可以负担得起。在我看来。

准备中 (Preparing)

As I mentioned before, there were a lot of apartments, let's look at them closely. import pandas as pd

正如我之前提到的，有很多公寓，让我们仔细看看。将熊猫作为pd导入

df = pd.read_csv('flats.csv')
df.shape

2310 flats for one month, we could extract something useful from that. What about a general data overview?

2310个单位使用了一个月，我们可以从中提取一些有用的信息。一般数据概述呢？

df.describe()

There is not something extraordinary - longitude, latitude, price of a flat(the label "cost"), and so on. Yes, for that moment I used "cost" instead of "price", I hope it will not lead to misunderstanding, please consider them as same.

没有什么特别的东西-经度，纬度，单位价格(标签为“ cost ”)等等。是的，在那一刻我使用“ 成本 ”而不是“ 价格 ”，希望不会引起误解，请考虑相同。

清洁用品 (Cleaning)

Does every record have the same meaning? Some of them are represented flats like a cubicle, you can work there, but you would not like to live there. They are small cramped rooms, not a real flat. Let remove them.

每个记录都有相同的含义吗？其中有些是像小隔间这样的有代表性的公寓，您可以在那儿工作，但不想住在那里。他们是狭窄的小房间，不是真正的公寓。让我们删除它们。

df = df[df.total_area >= 20]

Prediction price of flat comes from the oldest issues in economics and related fields. There was nothing related to the term "ML" and people tried to guess price based on square meters/feet. So, we look at these columns/labels and try to get the distribution of them.

单位价格的预测来自经济学和相关领域中最古老的问题。 “ ML”一词与之无关，人们试图根据平方米/英尺来猜测价格。因此，我们查看这些列/标签并尝试获取它们的分布。

numerical_fields = ['total_area','cost']
for col in numerical_fields:
    mask = ~np.isnan(df[col])
    sns.distplot(df[col][mask],  color="r",label=col)
    plot.show()

Well… there is nothing special, looks like a normal distribution. Perhaps we need to go deeper?

好吧……没有什么特别的，看起来像是正态分布。也许我们需要更进一步？

sns.pairplot(df[numerical_fields])

Oops… something wrong is there. Clean outliers in these fields and try to analyse our data again.

糟糕...出问题了。清除这些字段中的异常值，然后尝试再次分析我们的数据。

#Remove outliers
df = df[abs(df.total_area - df.total_area.mean()) <= (3 * df.total_area.std())]
df = df[abs(df.cost - df.cost.mean()) <= (3 * df.cost.std())]
#Redraw our data
sns.pairplot(df[numerical_fields])

Outliers have gone, and now it looks better.

离群值已消失，现在看起来更好。

转型 (Transformation)

The label "year", which is pointed at a year of construction should be transformed into something more informative. Let it be the age of building, in other words how a specific house is old.

指向建造年份的标签“ year”应该转换为更有意义的东西。让它成为建筑物的年龄，换句话说，就是一栋特定的房子是多么古老。

df['age'] = 2019 -df['year']

Let have a look at the result.

让我们看一下结果。

df.head()

There are all kinds of data, categorical, Nan-values, text-description and some geo-information (longitude and latitude). Let us put aside the last ones because on that stage they useless. We will back to them later.

数据种类繁多，包括分类，Nan值，文本描述和一些地理信息(经度和纬度)。让我们抛开最后一个，因为在那个阶段它们是无用的。我们稍后会再与他们联系。

df.drop(columns=["lon","lat","description"],inplace=True)

分类数据 (Categorical data)

Usually, for categorical data, people use different kinds of encoding or things like CatBoost which provide an opportunity to work with them as with numerical variables. But, could we use something more logical and more intuitive? Now is time for making our data more understandable without losing the meaning of them.

通常，对于分类数据，人们使用不同类型的编码或诸如CatBoost之类的东西，它们提供了与数字变量一起使用它们的机会。但是，我们可以使用更合理，更直观的方法吗？现在是时候让我们的数据更容易理解而不丢失其含义了。

地区 (Districts)

Well, there are over twenty possible districts, could we add over 20 additional variables in our model? Of course, we could, but… should we? We are people and we could compare things, is not it? First of all - not every district is equivalent to another. In the centre of city prices for one square meter is higher, further from downtown - it becomes to decrease. Does it sound logical? Could we use that? Yes, definitely we could match any district with a specific coefficient and the further district is the cheaper flats are.

嗯，有二十多个可能的地区，我们可以在模型中添加20多个其他变量吗？当然可以，但是……应该吗？我们是人，我们可以比较事物，不是吗？首先-并非每个区都等同。在城市中心，一平方米的价格更高，离市区更远-价格下降。听起来合乎逻辑吗？我们可以使用吗？是的，我们绝对可以匹配具有特定系数的任何区域，而其他区域则是较便宜的公寓。

After matching the city and using another web service map(ArcGIS Online) changed and has a similar view

匹配城市并使用其他Web服务地图(ArcGIS Online)后，其视图已更改，并且具有相似的视图

I used the same idea as for flat's visualization. The most "prestigious" and "expensive" district coloured in red and the least - blue. A colour temperature, do you remember about it? Also, we ought to do some manipulation over our dataframe.

我使用了与Flat可视化相同的想法。最“有名望”和“最昂贵”的地区以红色和最少的颜色-蓝色。色温，您还记得吗？另外，我们应该对数据框进行一些操作。

district_map =  {'alpha': 2, 
                 'beta': 4,
                 ...
                 'delta':3,
                 ...
                 'epsilon': 1}
df.district = df.district.str.lower()
df.replace({"district": district_map}, inplace=True)

The same approach will be used for describing the internal quality of the flat. Sometimes it needs some repair, sometimes flat is quite well and ready for living. And in other cases, you should spend additional money on making it look better(to change taps, to paint walls). There also could be use coefficients.

相同的方法将用于描述公寓的内部质量。有时它需要一些修理，有时平坦则可以居住。在其他情况下，您应该花更多的钱使它看起来更好(更换水龙头，粉刷墙壁)。也可能有使用系数。

repair = {'A': 1,
          'B': 0.6,
          'C': 0.7,
          'D': 0.8}
df.repair.fillna('D', inplace=True)
df.replace({"repair": repair}, inplace=True)

By the way, about walls. Of course, it also influences to price of flat. Modern material is better than older, brick is better than concrete. Walls from wood is quite a controversial moment, perhaps it is a good choice for the countryside, but not so good for urban life.

顺便说一下，关于墙壁。当然，它也会影响公寓的价格。现代材料比旧材料好，砖比混凝土好。用木头制成的墙壁是一个有争议的时刻，也许它是乡村的好选择，但对城市生活却不是那么好。

We use the same approach as before, plus make a suggestion about rows we do not know anything. Yes, sometimes people do not provide all the information about their flat. Furthermore, based on history we can try to guess about the material of walls. In a specific period of time (for example period Khrushchev's leading) - we know about typical material for building.

我们使用与以前相同的方法，并对不知道的行提出建议。是的，有时人们不会提供有关其公寓的所有信息。此外，根据历史，我们可以尝试猜测墙壁的材质。在特定的时间段内(例如赫鲁晓夫领导的时期)-我们了解典型的建筑材料。

walls_map = {'brick': 1.0,
             ...
             'concrete': 0.8,
             'block': 0.8,
             ...
             'monolith': 0.9,
             'wood': 0.4}
mask = df[df['walls'].isna()][df.year >= 2010].index
df.loc[mask, 'walls'] = 'monolith'
mask = df[df['walls'].isna()][df.year >= 2000].index
df.loc[mask, 'walls'] = 'concrete'
mask = df[df['walls'].isna()][df.year >= 1990].index
df.loc[mask, 'walls'] = 'block'
mask = df[df['walls'].isna()].index
df.loc[mask, 'walls'] = 'block'
df.replace({"walls": walls_map}, inplace=True)
df.drop(columns=['year'],inplace=True)

Also, there is information about the balcony. In my humble opinion - the balcony is a really useful thing, so I could not help myself considering it. Unfortunately, there are some null values. If the author of an advertisement had checked information about it, we would have more realistic information. Well, if there is no information it will mean "there is not a balcony".

此外，还有有关阳台的信息。以我的拙见-阳台是一件非常有用的东西，所以我忍不住考虑。不幸的是，有一些空值。如果广告的作者检查了有关该广告的信息，我们将获得更现实的信息。好吧，如果没有信息，它将意味着“没有阳台”。

df.balcony.fillna(0,inplace=True)

After that, we drop columns with information about the year of building(we have a good alternative for its). Also, we remove column with information about the type of building because it has a lot of NaN-values and I have not found any opportunity to fill these gaps. And we drop all rows with NaN which we have.

在那之后，我们删除带有有关建造年份信息的列(我们有一个很好的替代方法)。另外，我们删除包含有关建筑物类型的信息的列，因为它具有很多NaN值，而且我没有发现任何机会来填补这些空白。然后我们用NaN删除所有行。

df.drop(columns=['type_house'],inplace=True)
df = df.astype(np.float64)
df.dropna(inplace=True)

检查 (Checking)

So… we used a not standard approach and replace categorical values to their numerical representation. And now we finished with a transformation of our data. A part of the data has been dropped, but in general, it is a quite good dataset. Let look at the correlation between independent variables.

因此，我们使用了一种非标准的方法，将分类值替换为其数值表示形式。现在，我们完成了数据转换。一部分数据已删除，但总的来说，它是一个很好的数据集。让我们看一下自变量之间的相关性。

def show_correlation(df):
    sns.set(style="whitegrid")
    corr = df.corr() * 100
   # Select upper triangle of correlation matrix
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
   # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(15, 11))
   # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10)
   # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
                linewidths=1, cbar_kws={"shrink": .7}, annot=True,
                fmt=".2f")
    plot.show()
    # df[columns] = scale(df[columns])
    return df
df1 = show_correlation(df.drop(columns=['cost']))

Erm… it became very interesting. Positive correlation Total area - balconies. Why not? If our flat is big there will be a balcony. Negative correlation Total area - age. The newer is flat, the bigger is an area for living. Sound logical, new are more spacious flat than older ones. Age - balcony. The older is flat the fewer balconies it has. Seem like a correlation through another variable. Perhaps it is a triangle Age-Balcony-Area where one variable has an implicit influence on another. Put that on hold for a time. Age - district. The older flat is the big probability that will be placed in the more prestigious districts. Could it be related to higher price near the centre?

嗯...这变得非常有趣。 正相关 总面积阳台 。为什么不？如果我们的公寓很大，将会有一个阳台。 负相关 总面积年龄 。单位越新，居住面积越大。听起来合乎逻辑，新的比旧的更宽敞。 年龄阳台 。越平的阳台越少。好像是通过另一个变量的关联。也许这是一个三角形的Age-Balcony-Area，其中一个变量对另一个变量具有隐式影响。搁置一段时间。 年龄-地区。 较老的公寓将被安置在享有较高声誉的地区。可能与中心附近的较高价格有关吗？

Also, we could see the correlation with the dependent variable

另外，我们可以看到与因变量的相关性

plt.figure(figsize=(6,6))
corr = df.corr()*100.0
sns.heatmap(corr[['cost']],
            cmap= sns.diverging_palette(220, 10),
            center=0,
            linewidths=1, cbar_kws={"shrink": .7}, annot=True,
            fmt=".2f")

Here we go…

开始了…

The very strong correlation between the area of flat and price. If you want to have a bigger place for living it will require more money. There is a negative correlation between pairs "age/cost" and "district/cost". A flat in a newer house less affordable than the old one. And in the countryside flats are cheaper. Anyhow, it seems clear and understandable, so I decided to go with it.

单位面积与价格之间有很强的相关性。如果您想拥有更大的居住空间，将需要更多的钱。 “ 年龄/成本 ”和“ 地区/成本 ”对之间存在负相关。新房子里的公寓比旧房子便宜。在农村，公寓便宜。无论如何，这似乎很清楚而且可以理解，所以我决定使用它。

模型 (Model)

For tasks related to prediction flat's price usually, use linear regression. According to significant correlation from a previous stage, we could try to use it as well. It is a workhorse which is suitable for many tasks. Prepare our data for next actions

对于通常与预测单位价格有关的任务，请使用线性回归。根据前一阶段的显着相关性，我们也可以尝试使用它。这是一台适合许多任务的主力马。准备我们的数据以备后用

from sklearn.model_selection import train_test_split 
y = df.cost
X = df.drop(columns=['cost'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Also, we create some simple functions for prediction and evaluation of the result. Let do our first try to predict price!

此外，我们创建了一些简单的函数来预测和评估结果。让我们先尝试预测价格！

def predict(X, y_test, model):
    y = model.predict(X)
    score = round((r2_score(y_test, y) * 100), 2)
    print(f'Score on {model.__class__.__name__} is {score}')
    return score
def train_model(X, y, regressor):
    model = regressor.fit(X, y)
    return model

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
model = train_model(X_train, y_train, regressor)
predict(X_test, y_test, model)

Well… 76.67% of accuracy. Is it a big number or not? According to my point of view, it is not bad. Moreover, it is a good starting point. Of course, it is not ideal, and there is a potential for improvement.

好吧……准确度为76.67％。是不是很大？根据我的观点，这还不错。而且，这是一个很好的起点。当然，这不是理想的，并且有改进的潜力。

At the same time - we tried to predict only one part of the data. What about applying the same strategy for other data? Yes, time for cross-validation.

同时-我们试图仅预测数据的一部分。将相同的策略应用于其他数据该怎么办？是的，需要进行交叉验证的时间。

def do_cross_validation(X, y, model):
    from sklearn.model_selection import KFold, cross_val_score
    regressor_name = model.__class__.__name__
    fold = KFold(n_splits=10, shuffle=True, random_state=0)
    scores_on_this_split = cross_val_score(estimator=model, X=X, y=y, cv=fold, scoring='r2')
    scores_on_this_split = np.round(scores_on_this_split * 100, 2)
    mean_accuracy = scores_on_this_split.mean()
    print(f'Crossvaladaion accuracy on {model.__class__.__name__} is {mean_accuracy}')
    return mean_accuracy
do_cross_validation(X, y, model)

The result of the cross-validationNow we take another result. 73 is less than 76. But, it also a good candidate until a moment when we will have a better one. Also, it means that a linear regression works quite stable on our dataset.

交叉验证的结果现在我们得到另一个结果。 73小于76。但是，直到我们拥有一个更好的候选人之前，它也是一个不错的候选人。同样，这意味着线性回归在我们的数据集上非常稳定。

And now is a time for the last step.

现在是最后一步的时候了。

We will look at the best feature of linear regression - interpretability. This family of models, in opposite to more complex ones, has a better ability to for understanding. There are just some numbers with coefficients and you can put your numbers in the equation, make some simple math and have a result.

我们将着眼于线性回归的最好的功能- 可解释性 。与更复杂的模型相反，该模型家族具有更好的理解能力。只有一些带有系数的数字，您可以将它们放入等式，进行一些简单的数学运算并得出结果。

Let try to interpret our model

让我们尝试解释我们的模型

def estimate_model(model):  
    sns.set(style="white", context="talk")
    f, ax = plot.subplots(1, 1, figsize=(10, 10), sharex=True)
    sns.barplot(x=model.coef_, y=X.columns, palette="vlag", ax=ax)
    for i, v in enumerate(model.coef_.astype(int)):
        ax.text(v + 3, i + .25, str(v), color='black')
ax.set_title(f"Coefficients")
estimate_model(regressor)

The picture looks quite logical. Balcony/Walls/Area/Repair give a positive contribution to a flat price. The further flat is the bigger a negative contribution. Also applicable for age. The older flat is the lower price will be.

图片看起来很合逻辑。 阳台/墙壁/面积/维修对固定价格有积极贡献。距离越远， 负贡献越大。也适用于年龄。单位越大，价格越低。

So, it was a fascinating journey. We started from the ground, use the untypical approach for data transformation based on the human point of view(numbers instead dummy variables), checked variables and their relation to each other. After that, we build our simple model, used cross-validation for testing its. And as the cherry on the cake - look at the internals of model, what gives us confidence about our way.

因此，这是一次迷人的旅程。我们从头开始，基于人类的观点(数字代替虚拟变量)，检查变量及其相互关系，使用非典型方法进行数据转换。之后，我们建立了简单的模型，并使用交叉验证对其进行了测试。就像蛋糕上的樱桃一样-看一下模型的内部结构，这使我们对自己的方式充满信心。

But! It is not the finish our journey but only a break. We will try to change our model in the future and maybe (just maybe) it increases the accuracy of prediction.

但！这不是我们旅程的终点，只是休息。将来我们将尝试更改模型，也许(只是也许)它会提高预测的准确性。

Thanks for reading!

谢谢阅读！