机器学习几部分策略_机器学习助您一臂之力。第2部分

最新推荐文章于 2022-08-13 15:08:15 发布

cullen2012

最新推荐文章于 2022-08-13 15:08:15 发布

阅读量508

点赞数

文章标签： python 机器学习人工智能深度学习大数据

原文链接：https://habr.com/en/post/470710/

版权

机器学习几部分策略

Have you thought about the influence of the nearest metro to the price of your flat? What about several kindergartens around your apartment? Are you ready to plunge in the world of geo-spatial data?

您是否考虑过最近的地铁对公寓价格的影响？ your关于您公寓周围的几所幼儿园呢？您准备好进入地理空间数据世界了吗？

到底是什么？ (What is all about?)

In the previous part, we had some data and tried to find a good enough offer on a real estate market in Yekaterinburg.

在上一部分中，我们有一些数据并试图在叶卡捷琳堡的房地产市场上找到足够好的报价。

We had arrived at a point when we had an accuracy on cross-validation near 73%. However, every coin has 2 sides. And 73% accuracy it is 27% of error. How could we make that less? What is the next step?

我们的交叉验证的准确性接近73％。但是，每个硬币都有2个面。而73％的准确度是27％的误差。我们怎么能减少它呢？你下一步怎么做？

空间数据将为您提供帮助 (Spatial data is coming to help)

What about getting more data from the environment? We can use geo-context and some spatial data. Rarely people spend their entire life at home. Sometimes they go to shops, take kids from daycare. Their children grow up and go to school, university, etc.

如何从环境中获取更多数据呢？我们可以使用地理环境和一些空间数据。很少有人在家里度过一生。有时他们去商店，带孩子去日托中心。他们的孩子长大后去上学，上大学等。

Or… sometimes they need medical help and they are looking for a hospital. And a very important thing is public transport, metro at least. In other words, there are many things near there, that have an impact on pricing.

...有时他们需要医疗帮助，他们正在寻找医院。最重要的是至少是地铁的公共交通。换句话说，附近有很多东西会影响价格。

Let me show you a list of them:

让我给你看一个清单：

Public transport stops
公共交通站
Shops
商店
Kindergartens
幼稚园
Hospitals/medical institutions
医院/医疗机构
Educational institutions
教育机构
Metro
地铁

可视化新数据 (Visualization for new data)

After getting that information from different sources, I made a visualisation.

从获得该信息后不同来源，我做了一个可视化。

There are some points on the map the most prestigious (and expensive) district of Yekaterinburg.

在地图上有一些点是最著名的(也是最昂贵的)尤克地区。aterinburg

Red points - flats
ř版点-单位
Orange - stops
O range-停下来
Yellow - shops
Y ellow-商店
Green - kindergartens
摹颖-幼儿园
Blue - education
乙略-教育
Indigo - medical
我 ndigo -医疗
Violet - Metro
V iolet -新城

Yes, a rainbow is here.

是的，彩虹在这里。

总览 (Overview)

Now we have a dataset which is bounded with geodata and has some new information

现在我们有了一个以地理数据为边界的数据集，并且有了一些新信息

df.head(10)

df.describe()

一个好的旧模型 (A good old model)

Try the same way as before

与以前一样尝试

y = df.cost
X = df.drop(columns=['cost'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

Then we train our model again, cross our fingers and try to predict the price of flat again.

然后，我们再次训练我们的模型，不由自主地尝试再次预测公寓的价格。

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
model =   regressor.fit(X_train, y_train)
do_cross_validation(X_test, y_test, model)

Hmm… it looks better than the previous result with 73% of accuracy. What about trying of interpretation? Our previous model had a good enough ability to explain the flat price.

嗯...看起来比以前的结果好73％的准确性。尝试解释怎么办？我们以前的模型具有足够的能力来解释固定价格。

estimate_model(regressor)

Oops… Our new model works well with old features, but the behaviour with new ones seems odd.

糟糕…我们的新模型可以很好地与旧功能配合使用，但是新功能的行为似乎很奇怪。

For example, the bigger number of educational or medical institutions leads to a decrease in the price of flat. Accordingly, the number of stops nearby flat is an identical situation and it should gain an additional contribution to the flat price. The new model is more accurate, but it does not fit with real life.

例如，大量的教育或医疗机构导致公寓价格下降。因此，在公寓附近停靠的数量是相同的情况，它应该对公寓价格产生额外的贡献。新模型更准确，但与现实生活不符。

东西坏了 (Something is broken)

Let consider what happened.

让我们考虑发生了什么。

First of all — I want to remind you that key feature of our linear regression is… erm… linearity. Yes, Captain Obvious is here.

首先，我想提醒您，线性回归的主要特征是…erm…线性。是的，上尉船长在这里。

If your data is compatible with an idea "The bigger/lease is X the bigger/lease will Y" - linear regression will be a good tool. But geodata is more complex than we expected.

如果您的数据与“更大/更大的租赁量是X，更大/更大的租赁量将是Y”的思想兼容，那么线性回归将是一个很好的工具。但是地理数据比我们预期的要复杂。

For instance:

例如：

When near your flat is a bus stop it is good, but if the amount of them is around 5, it leads to a noisy street and people would like to avoid to buy a flat nearby.
当您的公寓附近是巴士站时，这是个不错的选择，但是如果公寓的数量大约为5，则会导致嘈杂的街道，因此人们希望避免在附近购买公寓。
If there is a university, it should have a good influence on price,
如果有一所大学，它应该对价格产生良好的影响，

at the same time a crowd of students near your home is not so pleased if you are not a very sociable person.
同时，如果你不是一个很善于交际的人，你家附近的一群学生也不会那么高兴。
Metro near your home is good, but if you live in one hour by foot
在家附近的地铁很好，但是如果您步行一小时即可到达

from the nearest metro - it should not make sense.
从最近的地铁站出发-应该没有意义。

As you see - it depends on many factors and points of view. And the nature of our geodata is not linear, we can not extrapolate the impact of them. At the same time, why does model with bizarre coefficients work better than the previous one?

如您所见-它取决于许多因素和观点。而且我们的地理数据的性质不是线性的，我们无法推断它们的影响。同时，为什么具有奇异系数的模型比以前的模型效果更好？

plot.figure(figsize=(10,10))
corr = df.corr()*100.0
sns.heatmap(corr[['cost']],
            cmap= sns.diverging_palette(220, 10),
            center=0,
            linewidths=1, cbar_kws={"shrink": .7}, annot=True,
            fmt=".2f")

It looks interesting. We have seen the similar picture in the previous part. There is a negative correlation between distance to the nearest metro and price. And this factor has an impact on accuracy more than some older ones.

看起来很有趣。我们已经在上一部分中看到了类似的图片。到最近的地铁的距离与价格之间存在负相关。而且这个因素对准确性的影响要比一些旧的影响更大。

Meanwhile, our model work messy and don not see dependencies between aggregated data and target variable. The simplicity of linear regression has its own limits.

同时，我们的模型工作混乱，看不到聚合数据和目标变量之间的依赖关系。线性回归的简单性有其自身的局限性。

国王死了，国王万岁！ (The king is dead, long live the king!)

And if a linear regression is not suitable for our case, what can be better? If only our model could be "smarter"…

如果线性回归不适合我们的情况，还有什么更好的选择？如果我们的模型可以更“聪明”……

Fortunately, we have an approach which should better because of it more… flexible and has a built-in mechanism "do if that do this else do that".

幸运的是，我们有一种方法应该更好，因为它更灵活……并且具有内置的机制“如果这样做，则执行此操作，否则则执行此操作”。

Decision Tree appears on the scene.

决策树出现在场景中。

from sklearn.tree import DecisionTreeRegressor
A decision tree can have a different depth, usually, it works well when depth is 3 and bigger. And the parameter of max depth has the biggest influence on the result. Let's do some code for checking depth from 3 to 32
data = [] 
for x in range(3,32):
    regressor = DecisionTreeRegressor(max_depth=x,random_state=42)
    model =   regressor.fit(X_train, y_train)
    accuracy = do_cross_validation(X, y, model)
    data.append({'max_depth':x,'accuracy':accuracy})
data = pd.DataFrame(data)
ax = sns.lineplot(x="max_depth", y="accuracy", data=data)
max_result = data.loc[data['accuracy'].idxmax()]
ax.set_title(f'Max accuracy-{max_result.accuracy}\nDepth {max_result.max_depth} ')

Well… for a situation when the max_depth of a tree is equal 8 the accuracy is above 77. And it would be a good achievement if we did not think about the limits of that approach. Let have a look at how it will work with max_depht=2

好吧...一种情况，当m一棵树的ax_depth等于8，精度超过77。如果我们不考虑这种方法的局限性，那将是一个很好的成就。让我们看看它是如何一起工作m ax_depht = 2

from IPython.core.display import Image, SVG
from sklearn.tree import  export_graphviz
from graphviz import Source
2_level_regressor = DecisionTreeRegressor(max_depth=2, random_state=42)
model = 2_level_regressor.fit(X_train, y_train)
graph = Source(export_graphviz(model, out_file=None
   , feature_names=X.columns
   , filled = True))
SVG(graph.pipe(format='svg'))

On this picture, we can see that there are only 4 variants of prediction. When you use DecisionTreeRegressor, it works differently than Linear Regression. Just differently. It does not use a contribution of factors (coefficients), instead of that DecisionTreeRegressor uses "likelihood". And the price of a flat will be the same as has the flat most similar on predicted. We can show it by predicting our price with that tree.

在这张图片上，我们可以看到只有4种预测变体。使用DecisionTreeRegressor时 ，其工作原理与线性回归不同。只是有所不同。它不使用因素(系数)的贡献，而是DecisionTreeRegressor使用“可能性”。一个单位的价格将与预期中最相似的单位相同。我们可以通过预测那棵树的价格来显示它。

y = two_level_regressor.predict(X_test)
errors = pd.DataFrame(data=y,columns=['errors'])
f, ax = plot.subplots(figsize=(12, 12))
sns.countplot(x="errors", data=errors)

And every your prediction will match with one of these values. And when we are using max_depth=8 we can expect no more than 256 different variants for over than 2000 flats. Maybe it is good for classification's issues, but it is not flexible enough for our case.

并且您的每个预测都将与这些值之一匹配。当我们使用max_depth = 8时，对于超过2000个单位，可以期望不超过256个不同的变体。也许这对于分类的问题是有好处的，但是对于我们的案例而言，它不够灵活。

人群的智慧 (Wisdom of crowd)

If you try to predict the score on the final of World Cup - there is a big probability you will be mistaken. At the same time, if you ask for opinion all judges on Championship - you will have better chances to guess. If you ask independent experts, trainers, judges and then do some magic with answers - your chances will increase significantly. Looks like an election of a president.

如果您试图预测世界杯决赛的比分-很有可能您会误会。同时，如果您要征求锦标赛所有评委的意见-您将有更好的猜测机会。如果您问独立专家，培训师，法官，然后用答案做些魔术-您的机会将大大增加。看起来像是总统选举。

An ensemble of several "primitive" trees can give more than each of them. And RandomForestRegressor is a tool which we will use

几棵“原始”树的集合可以提供比它们更多的树。和兰多mForestRegressor是我们将使用的工具

First of all, let's consider basic params - max_depth, max_features and a number of trees in the model.

首先，让我们考虑基本参数- 模型中的max_depth ， max_features和许多树 。

树数 (Number of trees)

In accordance with "How Many Trees in a Random Forest?" the best choice will be 128 trees. Further increasing of the number of trees does not lead to significant improvement in accuracy, but increase time for training.

按照“随机森林中有几棵树？” 最好的选择是128棵树 。树木数量的进一步增加不会导致准确性的显着提高，但是会增加训练时间。

最大数量的功能 (Maximal number of features)

Right now our model has 12 features. Half of them is old ones which are related to with features of flat, other related to geo-context. So I decided to give a chance for each of them. Let it be 6 features for a tree.

现在，我们的模型具有12个功能。其中一半是旧的，与平面特征有关，其他与地理环境有关。所以我决定给他们每个人一个机会。使其成为树的6个要素。

一棵树的最大深度 (Maximal depth of a tree)

For that parameter, we can analyse a learning curve.

对于该参数，我们可以分析学习曲线。

from sklearn.ensemble import RandomForestRegressor
data = []
for x in range(1,32):
    regressor = RandomForestRegressor(random_state=42, max_depth=x,
                                      n_estimators=128,max_features=6)
    model =   regressor.fit(X_train, y_train)
    accuracy = do_cross_validation(X, y, model)
    data.append({'max_depth':x,'accuracy':accuracy})
data = pd.DataFrame(data)
f, ax = plot.subplots(figsize=(10, 10))
sns.lineplot(x="max_depth", y="accuracy", data=data)
max_result = data.loc[data['accuracy'].idxmax()]
ax.set_title(f'Max accuracy-{max_result.accuracy}\nDepth {max_result.max_depth} ')

Whoa… over 86% accuracy on max_depth=16 against 77% on one design tree. It looks amazing, is not it?

哇... max_depth = 16上的准确度超过86％，而一棵设计树上的准确度则为 77％。看起来很棒，不是吗？

结论 (Conclusion)

Well… now we have a better result in prediction than previous ones, 86% is near the finish line. The last step for checking - let look at feature importance. Did geodata give any benefit to our model?

好吧……现在我们的预测结果比以前的要好，有86％的人接近终点。检查的最后一步-让我们看一下功能的重要性。地理数据是否对我们的模型有所帮助？

feat_importances = model.feature_importances_
feat_importances = pd.Series(feat_importances, index=X.columns)
feat_importances.nlargest(5).plot(kind='barh')

Some old features still have affected on the result. At the same time distance to the nearest metro and kindergartens also has affected. And it sounds logical.

一些旧功能仍会影响结果。同时到最近的地铁站和幼儿园的距离也受到影响。这听起来合乎逻辑。

Without a doubt, geodata helped us to improve our model.

毫无疑问，地理数据帮助我们改进了模型。

Thanks for reading!

谢谢阅读！

聚苯乙烯 (P.S.)

Our journey is not finished yet. 86% accuracy is a tremendous result for real data. Meanwhile, here is a small gap between 14% and 10% of mean error, which we expect. In the next chapter of our story, we will try to overcome this barrier or at least to decrease this error.

我们的旅程尚未结束。 86％的准确度对于真实数据是一个巨大的结果。同时，这是我们期望的平均误差的14％到10％之间的微小差距。在我们故事的下一章中，我们将尝试克服此障碍或至少减少此错误。

There is the IPython-notebook

有 IPython笔记本