参考文献_参考

参考文献

Recently, I am attracted by the news that Tanzania has attained lower middle income status under the World Bank’s classification, five years ahead of projection. Being curious on how they make the judgement, I take a look of the World Bank’s official website here.

[R ecently,我被这个消息,坦桑尼亚已经达到中等偏下收入水平世界银行的分类下,领先的投影五年吸引。 我对他们如何做出判断感到好奇,所以我在这里浏览了世界银行的官方网站。

Basically, the World Bank classifies the world’s economies into four income groups — high, upper-middle, lower-middle and low by considering Gross National Income (GNI) per capita (current US$).

基本上,世界银行考虑到人均国民总收入(现价美元),将世界经济分为四个收入组别:高收入,中上收入,低中收入和低收入。

Undoubtedly, the indicator is a great one representing the average income level of residents essentially living in the economy territory, which in turn captures the overall economic development level of the country. Nonetheless, I believe the picture should be more than that since countries within the same income group may still vary a lot in different aspects.

毫无疑问,该指标是一个很好的指标,代表了基本上生活在经济领土内的居民的平均收入水平,从而反映了该国的总体经济发展水平。 尽管如此,我相信情况应该不止如此,因为同一收入组中的国家在不同方面仍可能有很大差异。

Therefore, based on a group of interesting indicators selected from the World Bank database, I first try to apply factor analysis to see what dimension these indicators could represent, followed by cluster analysis to re-classify the economies. Hopefully, this article would help us understand the world in a better way. For the codes of this article, you can refer to the Github link here.

因此,基于从世界银行数据库中选择的一组有趣的指标,我首先尝试进行因子分析以了解这些指标可以代表什么维度 ,然后进行聚类分析以对经济进行重新分类 。 希望本文能够帮助我们更好地了解世界。 有关本文的代码,您可以在此处参考Github链接。

数据 (Data)

First of all, 29 indicators in different aspects are selected for this work. To prevent some potential bias caused by numerical indicators with significant scale difference, such as Gross Domestic Products (GDP) or Population size, I have mainly chosen ratio or growth indicators, with some non-traditional ones included, such as diabetes prevalence and mobile cellular subscriptions.

开始步骤的是,在不同的方面29个指标被选择用于这项工作。 为了避免由具有显着规模差异的数字指标(例如国内生产总值(GDP)或人口规模)引起的某些潜在偏差,我主要选择比率或增长指标,其中包括一些非传统指标,例如糖尿病患病率和移动电话订阅。

指标选择 (Indicators selected)

Image for post
List of selected indicators
选定指标清单

From the above list, you may notice that the year for indicators are not the same due to data availability. And I believe this is one of the major reasons why the World bank uses single indicator (GNI per capita) for economies classification.

从上面的列表中,您可能会注意到由于数据的可用性,指标的年份不相同。 我认为,这是世界银行使用单一指标(人均国民总收入)进行经济分类的主要原因之一。

The best we can do is to pick the indicators with a reasonable amount of countries provided the data (> 140 economies), and then choose the most recent year. After filtering, there are totally 159 countries included in this exercise.

我们能做的最好的事情就是从提供数据的国家中选择合理的指标(> 140个经济体),然后选择最近的年份。 筛选之后,此练习总共包括159个国家/地区

相关矩阵 (Correlation matrix)

Now, let’s take a look of the relationships between the indicators by plotting a correlation matrix using the following codes.

现在,通过使用以下代码绘制相关矩阵,让我们看一下指标之间的关系。

Image for post

From the correlation matrix, we can observe some interesting but reasonable relationships. For example,

从相关矩阵中,我们可以观察到一些有趣但合理的关系。 例如,

(a) Positive relationship between access to electricity (% of population) and percentage of people using at least basic drinking water services — Electricity and drinking water are basic services in the society. Both should be developed simultaneously in similar stage, and hence have similar level of accessibility within a country.

(a) 电力供应(人口百分比)与至少使用基本饮用水服务的人口比例之间的正相关关系 —电力和饮用水是社会的基本服务。 两者应在相似的阶段同时开发,因此在一个国家内具有相似的可访问性级别。

(b) Positive relationship between vulnerable employment (% of total employment) and employment in agriculture (% of total employment)Compared to employment in industrial and services sector, employment in agriculture should be more vulnerable.

(b) 脆弱就业(占总就业的百分比)与农业就业(占总就业的百分比 ) 之间的正相关关系 -与工业和服务业的就业相比,农业就业应更脆弱。

(c) Negative relationship between rural population (% of population) and individuals using the Internet (% of population) — Higher proportion of rural population in total, less developed the economy may be. So the proportion of rural population is negatively correlated with the percentage of individuals with access to the Internet, which represents the technological development of an economy.

(c) 农村人口(占人口的百分比)与使用互联网的个人之间的负关系(占人口的百分比) -农村人口占总人口的比例较高,经济可能较不发达。 因此,农村人口的比例与可以访问互联网的个人的比例呈负相关,这代表了经济的技术发展。

因子分析 (Factor analysis)

In fact, there are many other interesting relationships among the variables. In order to understand the whole picture in a faster and better way, we can apply factor analysis to reduce the 29 indicators into fewer numbers of factors.

事实上 ,还有变量之间许多其他有趣的关系。 为了以更快更好的方式了解整个情况,我们可以应用因子分析将29个指标减少为更少的因子。

But how many factors should be reduced to ? We can get an idea by plotting a scree plot with number of factors in the x-axis, and the eigenvalue in the y-axis. Generally, if a factor’s eigenvalue is greater than or close to one, we would include that. The scree plot below shows that there could be 7 factors.

但是应该减少多少因素呢? 我们可以通过在x轴上绘制带有多个因子的scree图,在y轴上绘制特征值来获得一个想法。 通常,如果一个因素的特征值大于或接近一个,我们将包括在内。 下面的卵石图显示可能有7个因素。

For the concepts behind factor analysis, this article gives a good explanation.

对于因素分析背后的概念, 本文给出了很好的解释。

碎石图 (Scree plot)

Image for post
Scree plot for factor analysis
Scree图用于因子分析

负载解释的方差 (Variance explained by loadings)

Image for post

The selection of 7 factors has explained 71% of total variance of the 29 indicators. The higher the percentage, the better the model is.

选择7个因子可以解释29个指标的总方差的71%。 百分比越高,模型越好。

因子负荷 (Factor Loadings)

Next, we take a look of the heatmap of factor loading, which is basically the correlation coefficient for the variable and factor. It shows the variance explained by the variable on that particular factor.

接下来,我们看一下因子加载的热图,它基本上是变量和因子的相关系数。 它显示了由该特定因子上的变量解释的方差。

Image for post

Let’s investigate in-depth of the 7 factors’ meaning, and see which variables have high correlation with each factor. Please note that the below interpretation is subjective.

让我们深入研究这七个因素的含义,并查看哪些变量与每个因素具有高度相关性。 请注意,以下解释是主观的。

Factor 0Access to essential services in society (Access to electricity for rural / urban population, individuals using the Internet and people using at least basic drinking water services, mobile cellular subscriptions)

因素0获得社会基本服务 (农村/城市人口,使用互联网的人和至少使用基本饮用水服务的人,移动蜂窝订阅的人获得电力)

Factor 1Youth employment situation (Employment to population ratios, ages 15–24, labor force participation ratio for ages 15–24, female / male)

因素1青年就业状况 (就业与人口的比例,15-24岁,15-24岁的劳动力参与率,男女)

Factor 2 Overall economic growth (GDP growth rate and GDP per capita growth rate)

要素2- 总体经济增长 (GDP增长率和人均GDP增长率)

Factor 3Industrial development (Value added of industry (including construction) and CO2 emissions)

因素3工业发展 (工业增加值(包括建筑业)和CO2排放量)

Factor 4 Health situation (Diabetes prevalence and PM2.5 air pollution) Many studies have solidified the link between particulates from cars and diabetes. If you are interested, this article is a good one.

因素4健康状况 (糖尿病患病率和PM2.5空气污染)许多研究已经巩固了汽车微粒与糖尿病之间的联系。 如果您有兴趣,这篇 文章 是不错的。

Factor 5 Capability in manufacturing & trade of manufactured goods (Employment in industry, merchandise trade and value added of manufacturing)

因素5制成品的制造和贸易能力 (工业就业,商品贸易和制造业增加值)

Factor 6 Professional services development (Trade in services, value added of services and secure internet servers)

要素6- 专业服务开发 ( 服务 贸易,服务增值和安全的互联网服务器)

Now, we know that the 7 factors are representing seven completely different aspects of an economy. However, a few indicators have very low correlation with all 7 factors, i.e. birth rate, death rate and infant mortality rate. Such low correlation may make sense as these indicators are more like the end-products of many elements in the economy. Thus, it is difficult to group them into any factors mentioned above.

现在,我们知道这七个因素代表了经济的七个完全不同的方面。 但是,一些指标与所有七个因素(即出生率,死亡率和婴儿死亡率)的相关性都非常低。 如此低的相关性可能是有道理的,因为这些指标更像经济中许多要素的最终产品。 因此,很难将它们分为上述任何因素。

聚类分析 (Cluster analysis)

Next, we will apply cluster analysis to classify the economies. In the followings, we would apply one of the most commonly used method — hierarchical clustering, with bottom up approach, Euclidean distance and Ward’s method to calculate the similarity. For detailed explanation of hierarchical clustering, this article gives a very good lesson.

ñ分机,我们将运用聚类分析,经济分类。 在下文中,我们将应用最常用的方法之一— 层次聚类,自下而上的方法,欧氏距离和Ward的方法来计算相似度 。 有关层次结构群集的详细说明, 本文提供了一个很好的课程。

标准化 (Standardization)

Each indicator has its own scale. For example, the proportion of rural population in total is always higher than GDP growth rate. To prevent such scale difference leading to unparalleled weights and unreliable conclusion, we have to first standardize the data.

每个指标都有自己的标度。 例如,农村人口占总人口的比例始终高于GDP增长率。 为了防止这种规模差异导致无与伦比的权重和不可靠的结论,我们必须首先对数据进行标准化。

层次聚类分析 (Hierarchical cluster analysis)

After standardizing the data, we can perform clustering using a library called AgglomerativeClustering.

标准化数据后,我们可以使用称为AgglomerativeClustering的库执行聚类。

And to visualize the clustering result, Dendrogram, a tree-like diagram that records the sequences of merges or splits, is applied. However, please note that the number of cluster finally formed is completely based on your judgement. If there are too many clusters, the classification may be too detailed. If too few, the economies may not be well classified.

为了可视化聚类结果, Dendrogram 应用记录合并或拆分序列的树状图。 但是,请注意,最终形成的簇的数量完全取决于您的判断。 如果群集太多,分类可能会太详细。 如果数量太少,可能无法很好地分类经济。

树状图 (Dendrogram)

Image for post
Dendrogram for hierarchical clustering
树状图用于层次聚类

From the dendrogram plot, there could be 12 clusters. And based on this choice, we next apply the function AgglomerativeClustering on the datasets, by setting n_clusters as 12, affinity as euclidean distance, and linkage as Ward’s method.

根据树状图,可能有12个簇。 然后根据此选择,通过将n_clusters设置为12,将亲和力设置为欧氏距离,并将链接设置为Ward's方法,将函数AgglomerativeClustering应用于数据集。

国家按类别列出 (Countries list by clusters)

The clustering result is shown in the form of a country list in the followings.

以下以国家列表的形式显示聚类结果。

Image for post
Country list by clusters
集群国家列表

集群的特征 (Characteristics of the clusters)

After grouping the 159 countries into 12 clusters, the most important is to understand the characteristics of each cluster, and investigate why the countries are grouped together. So let’s take a look of the heatmap below. I have extracted the 20 variables having high correlation with the 7 factors, and sorted by factor groups, i.e. first five rows represent Factor 0 — Access to essential services in society.

将159个国家/地区划分为12个类别后,最重要的是了解每个类别的特征,并调查为什么将这些国家/地区分组在一起。 因此,让我们看一下下面的热图。 我提取了与7个因子高度相关的20个变量,并按因子组进行了排序,即前五行代表因子0-获得社会基本服务的机会。

Image for post

Based on the clusters’ characteristics, I try to further group the 12 clusters into 4 big categories (Most developed, more developed, less developed and least developed economies). However, even within the big group, clusters’ characteristics still vary a bit. Please refer to the detailed descriptions below.

根据集群的特征,我尝试将12个集群进一步分为4大类( 最发达,较发达,欠发达最不发达的经济体 )。 但是,即使在大集团内部,集群的特征仍然有所不同。 请参考下面的详细说明。

最发达的经济体 (Most developed economies)

Common characteristics: Excellent accessibility to essential services in society, good industrial development and relatively good health situation

共同特点:良好的社会基本服务可及性,良好的工业发展和相对良好的健康状况

Cluster 0 (United States, United Kingdom and Japan) — high youth labor force participation level but slow economic growth

第0组 (美国,英国和日本)—青年劳动力参与水平高,但经济增长缓慢

Cluster 2 (France, Italy and Spain) — Great capability in trading manufactured goods but low youth labor force participation level

第2组 (法国,意大利和西班牙)—交易制成品的能力强,但青年劳动力的参与水平低

Cluster 7 (Ireland and Luxembourg) — very fast economic growth and excellent professional services development but relatively weak capability in manufacturing

第7组 (爱尔兰和卢森堡)-经济快速增长和出色的专业服务发展,但制造业能力相对较弱

较发达的经济体 (More developed economies)

Common characteristics: Good accessibility to essential services but relatively lower level in both youth labor force participation and professional services development

共同特点:基本服务的可及性良好,但青年劳动力参与和专业服务发展水平相对较低

Cluster 1 (Brazil, Argentina and Uruguay) — Poor industrial development and weak capability in manufacturing and trade of manufactured goods

第一组 (巴西,阿根廷和乌拉圭)-工业发展不佳,制成品的制造和贸易能力较弱

Cluster 5 (Qatar and Saudi Arabia) — Excellent industrial development but very poor health situation

第5组 (卡塔尔和沙特阿拉伯)-出色的工业发展,但健康状况非常差

Cluster 8 (China, Korea and South Africa) — Good industrial development and manufacturing capability with moderate economic growth

第8组 (中国,韩国和南非)-良好的工业发展和制造能力,经济增长适中

欠发达经济体 (Less developed economies)

Common characteristics: Moderate capability in manufacturing and trade of manufactured goods but low level in professional services development and poor accessibility to essential services in society

共同特点:制成品的制造和贸易能力中等,但专业服务开发水平较低,社会上难以获得基本服务

Cluster 6 (India, Egypt and Bangladesh) — great industrial development but very low level in youth labor force participation and poor health situation

第6组 (印度,埃及和孟加拉国)—工业发展良好,但青年劳动力参与水平很低,卫生状况很差

Cluster 9 (Vietnam and Cambodia) — very high level in youth labor force participation and very fast economic growth

第9组 (越南和柬埔寨)-青年劳动力参与率很高,经济增长非常快

Cluster 11 (Mexico, Indonesia and Philippines) — average industrial development and economic growth

第11组 (墨西哥,印度尼西亚和菲律宾)—平均工业发展和经济增长

最不发达国家 (Least developed economies)

Common characteristics: Moderate economic growth but very poor industrial development and accessibility to essential services in society

共同特点:经济增长适度,但工业发展非常差,无法获得社会上的基本服务

Cluster 3 (Afghanistan, Pakistan and Cameroon) — very weak capability in manufacturing and trade of manufactured goods and relatively poorer health situation

第3组 (阿富汗,巴基斯坦和喀麦隆)-制成品的制造和贸易能力很弱,健康状况相对较差

Cluster 4 (Zimbabwe and Uganda) — very high youth labor force participation level but poor professional services development and very weak capability in manufacturing and trade of manufactured goods

第4组 (津巴布韦和乌干达)-青年劳动力参与水平很高,但专业服务发展不佳,制成品的制造和贸易能力很弱

Cluster 10 (Djibouti and Namibia) — Above average capability in manufacturing and trade of manufactured goods but low youth labor force participation level

第10组 (吉布提和纳米比亚)-制成品的制造和贸易能力高于平均水平,但青年劳动力参与水平较低

To summarize, please refer to the below heatmap. The number inside the box is the cluster’s ranking among all in that aspect (factor). The smaller the number, the better performance of the cluster in that aspect.

总结一下,请参考下面的热图。 框内的数字是该群集在各个方面(因子)中的排名。 数字越小,集群在该方面的性能越好。

Image for post

与世界银行当前分类的比较 (Comparison with the World Bank’s current classification)

Last but not least, it would be interesting to compare our classification (Most developed, more developed, less developed and least developed) with the World Bank’s (high, upper-middle, lower-middle and low income).

最后但并非最不重要的一点是,将我们的分类( 最发达,最发达,欠发达和最不发达 )与世界银行的分类 ( 高,中上,中低和低收入 )进行比较会很有趣。

Image for post
Comparison of our classification with the World Bank’s
我们与世界银行的分类比较

Based on the comparison table, 55% of the countries are classified into same group under the two classification methods. Surprisingly, a matching probability of 70% and 94% is attained for high income and low income group respectively. In contrast, the matching probability is relatively low (<40%) for the two middle income groups.

根据比较表, 采用两种分类方法将55%的国家分类为同一组。 令人惊讶的是, 高收入和低收入人群的匹配率分别为70%和94% 。 相反,两个中等收入群体的匹配概率相对较低(<40%)。

结论 (Conclusion)

The above result has indicated that the Gross National Income (GNI) per capita may have only shown half of the picture. There are many other stories beyond that, especially for the middle income group / developing economies. The economic models and social situation for these countries could differentiate a lot even they may have similar level of GNI per capita.

以上结果表明, 人均国民总收入(GNI)可能只显示了一半 。 除此之外,还有许多其他故事,特别是对于中等收入群体/发展中经济体而言。 即使这些国家的人均国民总收入水平相近,其经济模式和社会状况也可能有很大差异。

This article has made use of two popular statistical methods — Factor analysis and Cluster analysis to help us understand the economies from different dimensions and classify the countries. I hope this would raise your interests to analyze the world’s economies in more dimensions and have a deeper thought beyond the official classification.

本文利用了两种流行的统计方法- 因子分析聚类分析,以帮助我们从不同的维度理解经济并对国家进行分类。 我希望这会引起您的兴趣,以便从更多角度分析世界经济,并在官方分类之外有更深入的思考。

Thank you very much, and see you next time.

非常感谢,下次见。

If you are interested to know about application of cluster analysis on stock selection, you may take a look of my another article below. Thanks.

如果您有兴趣了解聚类分析在股票选择中的应用,请阅读下面的另一篇文章。 谢谢。

翻译自: https://towardsdatascience.com/factor-analysis-cluster-analysis-on-countries-classification-1bdb3d8aa096

参考文献

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值