蒙特利尔睡眠数据库_在蒙特利尔使用数据科学进行房屋狩猎

蒙特利尔睡眠数据库

介绍 (Introduction)

I happen to live in Montreal, in my condo on the edge of McGill Ghetto. Close to Saint Laurent Boulevard or the Maine as locals call it, with all it's attractions — bars, restaurants, night clubs, drunken students. And once upon a time, on a particular lively night, listening to the sounds of McGill frosh students drunkenly heading home after hard night of studying. I thought, that it might be a good idea to move into my own house, a little bit further away from the action.

我碰巧住在蒙特利尔,在麦吉尔贫民窟边缘的公寓里。 靠近当地人称其为Saint Laurent Boulevard或缅因州的缅因州,拥有所有景点,包括酒吧,餐馆,夜总会,醉酒的学生。 曾几何时,在一个特别热闹的夜晚,听了麦吉尔弗罗什学生的声音,他们经过艰苦的学习夜夜醉酒地回家。 我认为,搬入自己的房子,离活动有点远,可能是个好主意。

Image

It was not my first rodeo, buying a real estate in Montreal, but first time buying a house. So, I decided to do a little bit of research, before trusting my money to a real estate agent. I quickly realized that I can't afford a house anywhere close to the subway station on the Island, but I could possible afford a duplex or a triplex, where tenants would be covering part of my mortgage. The solution to this problem depends not only on the price of the house, but also on the rent or potential rent that the tenants could be paying.

这不是我的第一个牛仔竞技表演,是在蒙特利尔购买房地产,而是第一次购买房屋。 因此,我决定做一些研究,然后再将钱委托给房地产经纪人。 我很快意识到,在岛上地铁站附近的任何地方都买不起房,但是我可以负担得起双工或三人房,房客将支付我部分抵押贷款。 解决这个问题的方法不仅取决于房屋价格,还取决于租户可能要支付的租金或潜在租金。

So, being a visual person with background in research, I wanted to see a visual map of how much things cost around the island, and how much revenue I could get. In the States, and even in Ontario there are services like Zillow that can show some of the information, but for Montreal I couldn't find anything, apart from the realtor association APCIQ. Maybe my preference of using English language is to blame.

因此,作为一个具有研究背景的视觉人物,我想看一眼视觉地图,以了解该岛周围的东西要花多少钱以及我可以得到多少收入。 在美国,甚至在安大略省,都有像Zillow这样的服务可以显示一些信息,但是对于蒙特利尔,除了房地产经纪人协会APCIQ之外,我什么也找不到。 也许我更喜欢使用英语。

So, after a few weeks of studying realtor.ca and kijiji, I wrote a python script to scrape information from them, using some resources I found on github: https://github.com/Froren/realtorca. Also, city of Montreal have an open data web site, that helps to fill-out some blanks.

因此,在研究了realtor.ca和kijiji几周之后,我编写了一个python脚本,使用在github上找到的一些资源从它们中抓取信息: https : //github.com/Froren/realtorca 。 此外,蒙特利尔市有一个开放的数据网站,可帮助您填补空白。

After the data is collected by webscrappers it is processed in R, using tidy-verse, Simple Features for R. I found excellent resources on how to process geospatial information in R: Geocomputation with R, I used ggplot2 to make graphs and thematic maps for map making.

通过webscrappers收集数据后,使用R的整洁简单特征在R中对其进行处理 。 我在R: Geocomputation with R中找到了有关如何处理地理空间信息的出色资源,我使用ggplot2制作了用于制作地图的图形和专题图

Now I have more then a year worth of data to study.

现在,我有超过一年的数据值得研究。

数据预处理 (Data pre-processing)

I preprocess the data by converting it into simple-features format first, and then changing the geographic coordinate reference system (longitude and latitude) to North American projection for Quebec and Ontario

我先对数据进行预处理,方法是先将其转换为简单特征格式,然后将地理坐标参考系统(经度和纬度)更改为针对魁北克和安大略的北美投影

library(tidyverse)
library(sf)

property<-read_csv("....") %>% 
 st_as_sf(coords=c("lng","lat"), crs=4326) %>% 
 st_transform(crs=32188)

公寓价格 (Condo price)

First I wanted to evaluate how much I could get for my condo. I need to define my neighborhood and find all the condos for sale around me.

首先,我想评估一下我可以从公寓得到多少。 我需要定义我的邻居并找到我附近所有待售的公寓。

邻里地图 (Neighborhood map)

neighbourhood<-geojson_sf("quartierreferencehabitation.geojson") %>%
 st_transform(32188) %>% 
 filter(nom_qr %in% c("Saint-Louis", "Milton-Parc")) %>% 
 summarize() %>% 
 st_buffer(dist=0)

Selecting condos for sale.

选择公寓出售。

neighbors <- st_join(property, neighbourhood, left=F)

Using a basemap from openstreetmap.

使用openstreetmap中的底图。

osm_neighbourhood<-read_osm(st_bbox(neighbourhood%>%st_transform(4326)), ext=1.5, type="esri")

Drawing results using tmap package.

使用tmap包绘制结果。

library(tmap)
library(tmaptools)

tm_shape(osm_neighbourhood) + tm_rgb(alpha=0.7)+
  tm_shape(neighbourhood) + tm_borders(col='red',alpha=0.8)  + 
  tm_shape(neighbors) + tm_symbols(shape=3,size=0.2,alpha=0.8) +
  tm_shape(ref_home) + tm_symbols(col='red',shape=4,size=0.5,alpha=0.8)+
  tm_compass(position=c("right", "bottom"))+
  tm_scale_bar(position=c("right", "bottom"))

社区公寓价格 (Neighbourhood condo prices)

Now I can show the prices, and see how the depend on condo surface area and if there is a parking lot. And If i use a simple linear regression I can get the first approximation of what my condo might be worth.

现在,我可以显示价格,看看如何取决于公寓的表面积以及是否有停车场。 而且,如果我使用简单的线性回归,则可以得出我的公寓可能价值的第一近似值。

线性模型 (Linear model)

More formally I can use linear model to predict price and confidence intervals

更正式地说,我可以使用线性模型来预测价格和置信区间

model_price_lm <- lm(mprice ~ parking:area_interior , data=neighbors_)

## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                41861.30   22421.28   1.867   0.0628 .  
## parkingFALSE:area_interior   436.65      23.56  18.530   <2e-16 ***
## parkingTRUE:area_interior    511.95      19.40  26.393   <2e-16 ***

So, in my neighborhood every square foot in a condo without parking adds 437$ to the base price of 42k$, and with parking it is 512$ per square foot. And now I can make a prediction of the price: 443k$ with confidence interval [422k$, 465k$]

因此,在我附近,不带停车位的公寓中每平方英尺增加了437 $至42k $的基本价格,带停车位则为每平方英尺512 $。 现在,我可以预测价格了:443k $,置信区间为[422k $,465k $]

However, if I look at the difference between what my model predicts for all the condos in the neighborhood and the prices, I can see that error depends on the predicted value:

但是,如果我查看模型对附近所有公寓的预测结果与价格之间的差异,则可以看到误差取决于预测值:

Therefore violating one of the conditions where simple linear regression can be used. This kind of behaviour is called overdispersion, and there are several ways of dealing with it. In particular, I found in the literature that I should be using a generalized linear model with inverse Gaussian distribution for errors and logarithmic link function.

因此违反了可以使用简单线性回归的条件之一。 这种行为称为过度分散 ,有几种处理方法。 特别是,我在文献中发现我应该使用具有反高斯分布广义线性模型来处理误差和对数链接函数。

广义线性模型 (Generalized linear model)

The estimate using generalized linear model is following:

使用广义线性模型的估计如下:

model_price_glm <- glm(mprice ~ parking:area_interior , data=neighbors_, 
                       family=inverse.gaussian(link="log"))

Which gives prediction 436k$ [422k$, 452k$]

预测为436k $ [422k $,452k $]

Note that I am ignoring number of rooms, floor of the building and the location of the condo for simplicity. It is possible to plug them all in into the regression, but it will increase number of parameters and make modelling results more difficult to interpret. Also, many parameters are correlated, for example bigger apartments tend to have more rooms and there a more of them with parking.

请注意,为简单起见,我忽略了房间数量,建筑物楼层和公寓位置。 可以将它们全部插入回归中,但是它将增加参数数量,并使建模结果更难以解释。 而且,许多参数是相关的,例如,较大的公寓往往有更多的房间,而其中有更多的带有停车位。

Now, for the sake of simplicity of comparing different properties, I could estimate price per square foot, and how it is affected by different factors.

现在,为了简化比较不同属性的目的,我可以估算每平方英尺的价格以及不同因素对价格的影响。

Again, using generalized linear model with inverse Gaussian distribution and log link:

再次,使用具有反高斯分布和对数链接的广义线性模型

每平方英尺价格 (price per square foot)

It's easy to make sense of the regression results:

理解回归结果很容易:

print(exp(model_psqft$coeff))

## (Intercept) parkingTRUE   bedrooms2   bedrooms3   bedrooms4 
## 501.7826165   1.1215192   0.9769839   0.9818974   0.8349424

So, the square foot is worth 501$, parking adds 12%, two bedrooms reduce price by 2.4%, three bedrooms by 1.2%, four bedrooms 17% (given the same total price).

因此,平方英尺的价格为501美元,停车位增加12%,两居室降低价格2.4%,三居室降低1.2%,四居室降低17%(总价相同)。

The predicted price of my condo is: 431k$ [414k$, 449k$]

我的公寓的预测价格是:431k $ [414k $,449k $]

纵向公寓价格模型 (Longitudinal condo price model)

All my previous models are showing results based on the condos on the market during the last year, without trying to account for the price change. It would have been interesting, how the price change with time. I have no idea how prices should behave, there is no reason to think that there is a steady linear trend, considering seasonal rise and fall in prices, so first, I could just smooth the data using loess function.

我以前的所有模型都是根据去年的市场公寓显示结果,而没有考虑价格变化。 价格会随时间变化会很有趣。 我不知道价格应该如何表现,没有理由认为考虑到价格的季节性上升和下降有一个稳定的线性趋势,所以首先,我可以使用黄土函数对数据进行平滑处理。

黄土平滑 (Loess smoothing)

If I pile all the data together:

如果我将所有数据堆在一起:

But if I try to separate by number of bedrooms, the results are kind of random, since the data might be too sparse.

但是,如果我尝试按卧室数量进行划分,则结果是随机的,因为数据可能太稀疏了。

So, it seems that I would rather want to have an overall smooth variation in price, while taking into account some features of the condos: i.e there is actually no reason to think that two bedroom condos are gaining in value slower then three bedroom ones. But there is variation of the proportion of different appartments with time, which would bias the results.

因此,在考虑公寓的某些功能的同时,我似乎希望价格总体上保持平稳变化:即,实际上没有理由认为两居室公寓的增值速度要慢于三居室的公寓。 但是,不同公寓的比例会随着时间变化,这将使结果产生偏差。

So, I am going to use generalized additive models where I can model overall change of price using a smooth function, while taking into account difference between different kinds of condos.

因此,我将使用广义的加性模型 ,在其中可以使用平滑函数对价格的整体变化进行建模,同时考虑到不同类型公寓之间的差异。

纵向公寓价格模型:GAM模型 (Longitudinal condo price model:GAM model)

# price model with time
model_psqft_t <- gam(price_sqft ~ bedrooms + parking + s(start_date, k=24) ,
          data=neighbors_, bs="cr",method='REML',
          family=inverse.gaussian(link="log"))

It still looks like the prices are going up.

看起来价格似乎还在上涨。

Using this model, the prediction of the price is 468k$ [435k$, 503k$]

使用此模型,价格的预测为468k $ [435k $,503k $]

卖多长时间 (How long would it take to sell)

Another important question — how long would it take to sell? For this one can use survival analysis Technically, it looks like some types of condos sell faster then others, but the difference is not big. It looks like half of the condos disappear from the market within 60 days :

另一个重要的问题-卖多长时间? 为此,可以使用生存分析从技术上讲,看起来某些类型的公寓出售得比其他类型的要快,但相差不大。 看来有一半的公寓在60天内从市场上消失了:

Plex价格估算 (Plex price estimate)

Similarly, when I am looking at the potential plex I would like to know how much houses cost in the neighborhood. Let's say within 2km radius of the plex I was interested at some point:

同样,当我查看潜在的建筑群时,我想知道附近有多少房屋要价。 假设在plex半径2公里以内,我对某个点感兴趣:

The price distribution is

价格分布为

Here i can see that the seller is asking slightly more then what is the average for neighborhood, but at the same time the variability is quite high. For plexes many more parameters are important then for condos, like the size of the backyard, which year the building was built and how much existing tennants are paying.

在这里,我可以看到卖方要问的要多于邻里的平均值,但与此同时变异性却很高。 对于plex,对于公寓而言,还有更多的参数很重要,例如后院的大小,建筑物的建造年份以及现有租户要付多少钱。

Using similar GLM model as for condos, the estimate for the price is the following: 567к$ [522k$, 616k$]

使用与公寓相似的GLM模型,价格估算如下:567к$ [522k $,616k $]

To estimate the rentals prices in the neighborhood I can find all the appartments listed on Kijiji during last year close by.

要估算附近的租金价格,我可以在附近找到Kijiji上列出的所有公寓。

The price distribution gives me idea how much I could be potentially getting from the tenants. Of course there might be existing tenants already, so it would show me if what they are paying is close to what's currently on the market.

价格分配使我知道,从租户那里我可能获得多少。 当然,可能已经有现有的租户,因此它将向我显示他们所支付的价格是否接近市场上的当前价格。

空间价格 (Spatial prices)

邻里平均 (Average over neighborhood)

Remember, my original question was to see the map of the prices in Montreal. The simplest would be to calculate median rental prices per neighborhood and show it on the map, like following:

请记住,我最初的问题是查看蒙特利尔的价格地图。 最简单的方法是计算每个街区的租金中位数,并在地图上显示出来,如下所示:

rent_by_quartier<-aggregate( kijiji_geo_p%>%filter(bedrooms==2) %>% 
 dplyr::select(price), mtl_p,median, join = st_contains)

Since I am not actually looking everywhere on the island, here is the central part. Blue cross is where I go for work.

由于我实际上并未在岛上到处寻找,所以这里是中心部分。 蓝十字是我上班的地方。

This map looks interesting, but it seem unrealistic to ussume that there are going to be sharp borders on the edges of neighborhoods. So, I would prefer to use a method that allows for smooth spatial change in prices. I can actually again use generalized additive models, as for the time course estimate, but with spatial coordinates.

这张地图看起来很有趣,但是使用邻域的边缘要有清晰的边界似乎是不现实的。 因此,我宁愿使用允许价格在空间上平稳变化的方法。 实际上,我可以再次针对时间过程估计使用广义加性模型 ,但是具有空间坐标。

租金价格空间模型 (Rental prices spatial gam model)

model_rent_geo_whole<-gam(price~bedrooms+s(x,y,k=100),
        data=rent,bs="cr",method='REML',
        family=inverse.gaussian(link="log"))

Rental prices in the central area, which is more interesting for me.

中心地区的租金价格,对我来说更有趣。

物价空间模型 (Plexes price spatial model)

In a same fashion, I can model distribution of the prices per square foot for triplexes with 3br main apartment and parking.

以同样的方式,我可以对带有3br主公寓和停车场的三层公寓的每平方英尺价格分布进行建模。

具有3br和停车位的三层车的表面积 (Surface area for a triplex with 3br and parking)

Now that I have spatial price distribution, I can also model surface area distribution. This, technically can be done using data from the city website. But for this example I am using only property that was on the market

现在有了空间价格分布,我还可以对表面积分布进行建模。 技术上讲,这可以使用城市网站上的数据来完成。 但是对于这个例子,我只使用市场上的房产

三重盈利能力(每年租金/三重总价) (Triplex Profitability (rent per year/triplex total price))

This way I can roughly estimate profitability of triplexes in different parts of town. By calculating a total price and dividing by the potential income of two two-bedroom apartments rented for the year. Of course this is very rough estimate, since I am assuming that all triplexes will have two 4 1/2 apartments for rent.

这样,我可以粗略估计城镇不同地区的三元组的获利能力。 通过计算总价并除以该年租用的两个两居室公寓的潜在收入。 当然,这是一个非常粗略的估计,因为我假设所有三层公寓都会有两个4 1/2公寓出租。

Plex纵向价格模型:高原,Ahuntsic,Rosemont,Villeray (Plex Longitudinal price model: Plateau, Ahuntsic, Rosemont, Villeray)

Finally, using the same idea that was used for tracking condo price during the year, I can track plexes prices in the boroughs that were interesting for me.

最后,使用与该年度跟踪公寓价格相同的想法,我可以跟踪对我来说很有趣的自治市中的plex价格。

结论 (Conclusions)

I did this research to study the distribution of prices in Montreal and to familiarize myself with geospatial modelling in R. I didn't have access to the actual sale prices, so the results should be taken with a grain of salt.

我进行了这项研究,以研究蒙特利尔的价格分布,并使自己熟悉R中的地理空间建​​模。我无法获得实际的销售价格,因此应以一定的价钱来得出结果。

源代码和数据 (Source code and data)

The complete source of scripts used for this publication is publicly available on github: (https://github.com/vfonov/re_mtl), version of this article rendered using rmarkdown is available at http://www.ilmarin.info/re_mtl/stats_eng.html

可以在github上公开获得用于此出版物的脚本的完整源:( https://github.com/vfonov/re_mtl ),使用rmarkdown呈现的本文版本可在http://www.ilmarin.info/re_mtl中获得。 /stats_eng.html

交互式价格分布图 (Interactive map of prices distribution)

Results are also shown in an interactive dashboard on (http://www.ilmarin.info/re_mtl/)

结果还显示在( http://www.ilmarin.info/re_mtl/ )上的交互式仪表板中

翻译自: https://habr.com/en/post/490546/

蒙特利尔睡眠数据库

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值