普里姆从不同顶点出发_来自三个不同聚类分析的三个不同教训数据科学的顶点...

最新推荐文章于 2024-04-12 09:55:30 发布

张_伟_杰

最新推荐文章于 2024-04-12 09:55:30 发布

阅读量6.4k

点赞数 1

原文链接：https://medium.com/@msuginoo/three-different-lessons-from-three-different-clustering-analyses-data-science-capstone-5f2be29cb3b2

版权

普里姆从不同顶点出发

绘制大流行时期社区的风险群图：以布宜诺斯艾利斯为例 (Map Risk Clusters of Neighbourhoods in the time of Pandemic: a case of Buenos Aires)

介绍 (Introduction)

Every year is unique and particular. But, 2020 brought the world the special planetary pandemic challenge of COVID-19. It spread and penetrated rapidly into different parts of the globe. And, the autonomous city of Buenos Aires (CABA: Ciudad Autonoma de Buenos Aires) is not an exception.

每年都是独一无二的。但是，2020年给世界带来了COVID-19的特殊行星大流行挑战。它Swift传播并渗透到全球各地。而且，布宜诺斯艾利斯自治市(CABA：布宜诺斯艾利斯自治城市)也不例外。

In this particular setting, in order to craft the settings for my capstone project, I contemplated a hypothetical corporate client in the food industry (catering business) from abroad (The Client), that is planning to relocate their representative family to the city of Buenos Aires (CABA) for their future entry into Argentina once the pandemic-related restrictions are lifted. Since this would be its very first entry to Buenos Aires, the city is still an unknown territory for the Client.

在这个特殊的环境中，为了完善我的顶峰项目的环境，我考虑了一个来自国外(客户)食品行业(餐饮业务)的假设企业客户，该公司计划将其代表家庭搬到布宜诺斯艾利斯市一旦取消与大流行有关的限制，Aires(CABA)便会在未来进入阿根廷。由于这将是它第一次进入布宜诺斯艾利斯，因此该城市对于客户来说仍然是一个未知的领域。

Very concerned with the two risks — the general security risk (crime) and the pandemic risk (COVID-19) — the Client wants to exclude high risk neighbourhoods in the selection of the location for the plan. In addition, the Client wants to capture the characteristics of neighbourhoods based on popular commercial venue categories such as restaurants, shops, and sports facilities. In this context, the Client hired me as an independent data analyst to conduct a preliminary research for its future plan.

客户非常关注这两种风险-一般安全风险(犯罪)和大流行风险(COVID-19)-客户希望在选择计划的地点时排除高风险社区。此外，客户希望根据受欢迎的商业场所类别(例如餐厅，商店和体育设施)来捕捉社区的特征。在这种情况下，客户聘请我担任独立数据分析师，以对其未来计划进行初步研究。

The Client stressed that this is the first-round preliminary analysis for a further extended study for business expansion. And based on the finding from this preliminary analysis, the Client wants to explore the scope of the future analysis. Simply put, the Client wants to conduct the preliminary analysis within a short period of time under a small budget to taste the flavour of the subject.

客户强调，这是为进一步扩展业务扩展研究而进行的第一轮初步分析。并且，基于此初步分析的结果，客户希望探索未来分析的范围。简而言之，客户希望在短时间内以少量预算进行初步分析，以品尝主题的味道。

The Client sets the following three objectives for this preliminary assignment.

客户为此初步任务设定以下三个目标。

Identify outlier high risk neighbourhoods (the Outlier Neighbourhood/Cluster) in terms of these two risks — the general security risk (crime) and the pandemic risk (COVID-19).
从这两个风险(一般安全风险(犯罪)和大流行风险(COVID-19))中识别异常高风险社区(异常社区/集群)。
Segment non-outlier neighbourhoods into several clusters (the Non-Outlier Clusters) and rank them based on a single quantitative risk metric (a compound risk metric of the general security risk and the pandemic risk).
将非离群的邻域划分为多个群集(非离群的群集)，并基于单个定量风险度量(一般安全风险和大流行风险的复合风险度量)对它们进行排名。
Use Foursquare API to characterize the Non-Outlier Neighbourhoods regarding popular venues. And if possible, segment Non-Outlier Neighbourhoods according to Foursquare venue profiles.
使用Foursquare API来描述有关受欢迎场所的非离群社区。并且，如果可能，请根据Foursquare场地配置文件对非离群区域进行细分。

The autonomous city of Buenos Aires (CABA) is a densely populated city: the total population of approximately 3 million in the area of 203 km2. And each neighbourhood has its own distinct size of area and population. The city is divided into 48 administrative division, aka ‘barrios’, to which I will refer simply as ‘neighbourhoods’ in this report.

布宜诺斯艾利斯自治市(CABA)是一个人口稠密的城市：总人口约300万，面积203平方公里。每个邻域都有其自己独特的面积和人口规模。该市分为48个行政区，又名“ barrios”，在本报告中，我将其简称为“社区”。

The Client expressed their concern about the effect of the variability of population density among neighbourhoods. These two risks of the Client’s concern — the general security risk (crime) and the pandemic risk (COVID-19) — are likely affected by the population density profiles. Especially, the fact that ‘social distancing’ is a key to the prevention of COVID-19 suggests that population density is a significant attribute for the pandemic risk. In other words, the higher the population density, the higher the infection rate. The similar can be true for the general security risk. Obviously, this preconception needs to be assessed based on the actual data in the course of the project. This needs to be kept in mind for the analysis. Nevertheless, the Client ask me to scale risk metrics by ‘population density’ for the first round of the project.

客户对邻里人口密度变化的影响表示关注。客户关注的这两个风险(一般安全风险(犯罪)和大流行风险(COVID-19))可能会受到人口密度状况的影响。特别是，“社会隔离”是预防COVID-19的关键这一事实表明，人口密度是大流行风险的重要属性。换句话说，人口密度越高，感染率越高。对于一般的安全风险也是如此。显然，需要根据项目过程中的实际数据来评估这种先入之见。分析时必须牢记这一点。但是，客户要求我在项目的第一轮中按“人口密度”来衡量风险指标。

Overall, the Client demonstrated high enthusiasm about Machine Learning and requested me to use machine learning models to achieve all these three objectives aforementioned.

总体而言，客户表现出了对机器学习的高度热情，并要求我使用机器学习模型来实现上述所有三个目标。

That is the background (business problem) scenario for this capstone project. On one hand, the scenario setting is totally hypothetical. On the other hand, the project handles real data.

这是此顶点项目的背景(业务问题)方案。一方面，方案设置完全是假设的。另一方面，项目处理实际数据。

Cut a long story short, for these three objectives presented above, I performed three different clustering machine-learning models. And I got three different lessons out of them. All of them are valuable. And in Discussion section of this article I will stress these different implications from the perspective of Data Science project management.

简而言之，对于上述三个目标，我执行了三种不同的集群机器学习模型。我从中学到了三堂课。所有这些都是有价值的。在本文的“ 讨论”部分，我将从数据科学项目管理的角度强调这些不同的含义。

For now, I will invite you to walk through the process of the analysis.

现在，我将邀请您逐步进行分析。

The code of the project could be viewed in the following link of my GitHub repository:

可以在我的GitHub存储库的以下链接中查看项目的代码：

· Code: https://github.com/Hyper-Phronesis/Capstone-1/blob/master/Capstone%20Three%20Different%20Lessons%20from%20Three%20Different%20Clusterings.ipynb

·代码： https ： //github.com/Hyper-Phronesis/Capstone-1/blob/master/Capstone%20Three%20Different%20Lessons%20from%20Three%20Different%20Clusterings.ipynb

Now, let’s start.

现在，让我们开始。

业务理解与分析方法 (Business Understanding and Analytical Approach)

At the beginning of a Data Science project, we need to clarify the following two basic questions

在数据科学项目开始时，我们需要澄清以下两个基本问题

what needs to be solved. (Business Understanding)
需要解决的问题。 (业务理解)
what kind of approach we need to make in order to achieve the objective. (Analytical Approach)
为了达到目标，我们需要采取哪种方法。 (分析方法)

For the case of this project, the Client already has specified both. What the Client wants are risk profiling, venue profiling, and clustering of neighbourhoods. These are all about analysis of the status quo, in other words, descriptive analysis; or potentially, it might involve diagnostic (what happened or what are happening). In other words, the Client is not asking for a forecast (predictive analysis) or how to solve the problem (prescriptive analysis) — at least at this preliminary stage. These navigate the overall direction of our analysis.

对于此项目，客户端已经指定了两者。客户需要的是风险剖析，场所剖析和社区聚类。这些都是关于现状的分析，换句话说就是描述性分析。或可能涉及诊断(发生了什么或正在发生什么)。换句话说，至少在这个初步阶段，客户并没有要求进行预测(预测分析)或如何解决问题(描述性分析)。这些将指导我们分析的总体方向。

Now, all clear. Let’s mover to the next. Now, we start talking about data.

现在，一切都清楚了。让我们前进到下一个。现在，我们开始讨论数据。

A.数据部分 (A. Data Section)

A1。资料需求： (A1. Data Requirements:)

By an analogy to cooking, Data Requirements is like a recipe, what ingredients we would need for cooking the dish: thus, what kind of data we would need for the analysis. The three objectives set by the Client determine the data requirements as follow:

类似于烹饪，“数据需求”就像一个食谱，说明我们烹饪菜肴所需的食材：因此，我们需要哪种数据进行分析。客户设定的三个目标确定数据要求如下：

(1) Basic information about the neighbourhoods in Buenos Aires.

(1)关于布宜诺斯艾利斯居民区的基本信息。

The area and the population for each neighbourhood
每个社区的面积和人口
The geographical coordinates to determine the administrative border of each neighbourhood (for map visualization)
确定每个邻域的行政边界的地理坐标(用于地图可视化)

(2) Risk statistics:

(2)风险统计：

For the first and the second objectives, I would need to gather the following historical statistics to construct a compound risk metric to profile neighbourhoods from the perspectives of both the general insecurity risk (crime) and the pandemic risk (COVID-19).

对于第一个和第二个目标，我将需要收集以下历史统计数据，以从一般不安全风险(犯罪)和大流行风险(COVID-19)的角度构建复合风险度量标准，以对街区进行概要分析。

general security risk statistics (crime incidences) by neighbourhoods
社区的一般安全风险统计(犯罪发生率)
pandemic risk statistics (COVID-19 confirmed cases) by neighbourhoods
社区的大流行风险统计(COVID-19确诊病例)

(3) Foursquare Data:

(3)Foursquare数据：

For the third objective, the Client requires me to specifically use Foursquare API in order to characterise each Non-Outlier Neighbourhood.

对于第三个目标，客户要求我专门使用Foursquare API来表征每个非离群社区。

A2。数据源 (A2. Data Sources)

Based on the data requirements, I explored the publicly available data. Then, I encountered the following relevant sources.

根据数据需求，我探索了公开可用的数据。然后，我遇到了以下相关资源。

(1) Basic info of the neighbourhoods of CABA:

(1)CABA社区的基本信息：

the area and the population of all the relevant neighbourhoods from Wikipedia: https://en.wikipedia.org/wiki/Neighbourhoods_of_Buenos_Aires
维基百科上所有相关社区的面积和人口： https ： //en.wikipedia.org/wiki/Neighbourhoods_of_Buenos_Aires
The city government of Buenos Aires provides a GeoJson file that contains the geographical coordinates which defines the administrative boundary of Barrios (the neighbourhoods) of Buenos Aires. https://data.buenosaires.gob.ar/dataset/barrios/archivo/1c3d185b-fdc9-474b-b41b-9bd960a3806e
布宜诺斯艾利斯市政府提供了一个GeoJson文件，其中包含地理坐标，该地理坐标定义了布宜诺斯艾利斯Barrios(社区)的行政边界。 https://data.buenosaires.gob.ar/dataset/barrios/archivo/1c3d185b-fdc9-474b-b41b-9bd960a3806e

(2) Historical risk statistics.

(2)历史风险统计。

Crime Statistics: A csv file which is compiled and uploaded by Rama in his GitHub depository: https://github.com/ramadis/delitos-caba/releases/download/3.0/delitos.csv
犯罪统计数据：一个由Rama在其GitHub存储库中编译并上传的csv文件： https ： //github.com/ramadis/delitos-caba/releases/download/3.0/delitos.csv
COVID-19 Statistics: the city government’s website provides the COVID-19 statistics by neighbourhood: https://cdn.buenosaires.gob.ar/datosabiertos/datasets/salud/casos-covid-19/casos_covid19.xlsx
COVID-19统计信息：市政府的网站按邻居提供COVID-19统计信息： https ://cdn.buenosaires.gob.ar/datosabiertos/datasets/salud/casos-covid-19/casos_covid19.xlsx

(3) Foursquare Data for Popular Venues by Neighbourhood:

(3)各地区热门场所的Foursquare数据：

As per the Client’s requirement, I would specifically use Foursquare API in order to characterise each Non-Outlier Neighbourhood.

根据客户的要求，我将专门使用Foursquare API来表征每个非离群社区。

A3。数据采集 (A3. Data Collection)

What follow now are data collection, data understanding, and data preparation. These parts altogether usually occupy a majority of time for the project, e.g. in a range of 60–70%.

现在，接下来是数据收集，数据理解和数据准备。这些部分通常总共占项目的大部分时间，例如占60-70％。

For this article, I would compress the description of these time-consuming parts, by only outlining highlights.

对于本文，我将仅概述重点内容来压缩这些耗时部分的描述。

After downloading all the relevant data from the data sources above, I have made data reconciliation — cleaning data and transforming it in a coherent format. Thereafter, I consolidated all the relevant data into two datasets: “Risk Profile of Neighbourhoods” dataset and “Foursquare Venue Profile” dataset. The first 5rows of each dataset are presented below to illustrate their components.

从上面的数据源下载了所有相关数据之后，我进行了数据对帐-清理数据并将其转换为一致的格式。之后，我将所有相关数据合并为两个数据集：“街区风险概况”数据集和“四方场地概况”数据集。下面介绍了每个数据集的前5行，以说明它们的组成。

The first 5 rows of “Risk Profile of Neighbourhoods”:

“邻里风险概况”的前5行：

Image for post — **Neighbourhoods’ Features & Risk Data: the first 5 rows** **邻居的特征和风险数据：前5行**

The first 5 rows of “Foursquare Venue Profile”:

“四方场地简介”的前5行：

Here is an outline of data limitation below.

以下是数据限制的概述。

(1) Crime Statistics: “Crime Severity Score”

(一)犯罪统计：“犯罪等级”

The compiled crime data covers only the period between Jan 1, 2016 and Dec 31, 2018. For the purpose of the project, I would make an assumption that the data during the available period would be good enough to serve a representative proxy for the risk characteristic of each neighbourhood.

汇总的犯罪数据仅涵盖2016年1月1日至2018年12月31日期间。就本项目而言，我假设可用期间的数据足以为风险提供代表性代表每个社区的特征。

The original crime statistics had 7 crime categories. They were weighted according to the severity of crime category and transformed to generate one single metric “Crime Severity Score”.

原始犯罪统计数据有7种犯罪类别。根据犯罪类别的严重程度对它们进行加权，然后转换为一个度量“犯罪严重度评分”。

(2) COVID-19 Statistics: “COVID-19 Confirmed Cases”

(2)COVID-19统计：“ COVID-19确诊病例”

In order to measure the pandemic risk, I simply extracted the cumulative confirmed cases of COVID-19 for each neighbourhood. I did not net out the recovered cases from the data. Thus, the COVID-19 statistics in this analysis is a gross figure. My assumption here is that the gross data will proxy the empirical risk profile of COVID-19 infection.

为了衡量大流行的风险，我只提取了每个社区累积的确诊的COVID-19病例。我没有从数据中扣除恢复的案件。因此，此分析中的COVID-19统计数据为毛值。我在这里的假设是，总数据将替代COVID-19感染的经验风险概况。

(3) Foursquare Data:

(3)Foursquare数据&#x

最低0.47元/天解锁文章

张_伟_杰

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
普里姆从不同顶点出发_来自三个不同聚类分析的三个不同教训数据科学的顶点...

普里姆从不同顶点出发绘制大流行时期社区的风险群图：以布宜诺斯艾利斯为例 (Map Risk Clusters of Neighbourhoods in the time of Pandemic: a case of Buenos Aires) 介绍 (Introduction)Every year is unique and particular. But, 2020 brought the...
复制链接

扫一扫