表征两幅图细节差异的指标_基于财务指标表征公司特征的因素分析

最新推荐文章于 2024-10-18 00:00:00 发布

weixin_26640581

最新推荐文章于 2024-10-18 00:00:00 发布

阅读量263

点赞数

文章标签： python java 机器学习算法大数据

原文链接：https://medium.com/swlh/factor-analysis-characterising-companies-based-on-financial-metrics-3d5fcc4e8b6f

版权

表征两幅图细节差异的指标

Note: All codes are available in the Repo: https://quanp.readthedocs.io/en/latest/tutorials.html

注意：所有代码都可以在仓库中找到： https : //quanp.readthedocs.io/en/latest/tutorials.html

Previously, the author performed principle component analysis on the financial metrics for all the S&P500 companies and found that the first 5 PCs carried most of the variance ratio. This article does not intend to replicate the previous work. It is recommended to first read through the previous article before proceeding.

以前，作者对所有S＆P500公司的财务指标进行了主成分分析，发现前5台PC拥有大部分差异率。本文无意复制以前的工作。建议先通读上一篇文章，然后再继续。

A principal component (dimension) from PCA can be considered as a factor that consists of a space that made up of a set of features. Fundamentally, PCA or a similar analysis, Factor analysis (FA), allows variables that are correlated with one another but largely independent of other subsets of variables to combine as components/ factors. Both PCA and FA summarise patterns of correlations among observed variables and reduce a large number of observed variables (features/dimensions) to a smaller number of components/factors. Frequently, these factors/components analysis produces an operational definition for an underlying processes by using correlation/contributions (loadings) of observed variable in a factor/component (Tabachnick & Fidell, 2013).

PCA的主要组成部分(尺寸)可以被认为是一个由一组要素组成的空间组成的因素。从根本上说，PCA或类似的分析，因子分析(FA)，允许相互关联但很大程度上独立于变量的其他子集的变量组合为组件/因子。 PCA和FA都总结了观察到的变量之间的相关关系模式，并将大量观察到的变量(特征/尺寸)减少为更少的组件/因子。通常，这些因素/成分分析通过使用因素/成分中观察变量的相关性/贡献(负荷)来为基础过程产生可操作的定义(Tabachnick＆Fidell，2013)。

Here, we will dive deep at the first 5 PCs/factors and their respective underlying features.

在这里，我们将深入探讨前5个PC /因素及其各自的潜在功能。

1.下载数据 (1. Download the data)

Here, we get the 505 S&P500 member companies listed on the wikipedia & get a list of fundamental metrics for each company from the TD Ameritrade API (All functions are available from the quanp tools).

在这里，我们获得Wikipedia上列出的505家S＆P500成员公司，并从TD Ameritrade API(所有功能可从quanp工具获得)中获取每个公司的基本指标列表。

The list of potentially useful fundamental variables/features:-

潜在有用的基本变量/功能列表：

2.简单特征预处理 (2. Simple features preprocessing)

The distributions of variables for factor analysis (FA)/principle component analysis (PCA) are not in force. However, if variables are normally distributed, the analysed result is usually enhanced (Tabachnick & Fidell, 2013). Here, similar to the previous article, we only do 2 simple and standard preprocessing, log(x+1) Transformation followed by a Standardization Scaling.

用于因子分析(FA)/原理成分分析(PCA)的变量分布无效。但是，如果变量呈正态分布，则分析结果通常会得到增强(Tabachnick＆Fidell，2013)。在这里，与上一篇文章类似，我们仅执行2个简单且标准的预处理，即log(x + 1)转换和标准缩放。

3.使用PCA进行尺寸缩减 (3. Dimensional reduction using PCA)

PCA, a form of orthogonal rotation, so that the extracted components/factors are uncorrelated with each other, can reduce the dimensionality of the data by running PCA, which reveals the main axes of variation and denoises the data. Previously, we found that the ‘elbow’ point of the PCA variance ratios seems to suggest at least up to PC5 will be useful to characterize the companies.

PCA是正交旋转的一种形式，因此提取的成分/因子彼此不相关，可以通过运行PCA来降低数据的维数，该操作揭示了变化的主轴并对数据进行了降噪。以前，我们发现PCA方差比率的“弯头”点似乎表明至少要达到PC5才能对公司进行特征描述。

For instance, we also found that the Information Technology, Financial, and Energy can be separated from low to high PC1.

例如，我们还发现，信息技术，金融和能源可以从低PC1到高PC1分开。

4.剖析PCA输出 (4. Dissecting PCA outputs)

In fact, PCA or a similar analysis, Factor Analysis (FA), fundamentally allow variables that are correlated with one another but largely independent of other subsets of variables to combine as components/factors. Both PCA and FA summarise patterns of correlations among observed variables and reduce a large number of observed variables (features/dimensions) to a smaller number of components/factors. Frequently, these factors/components analysis produces an operational definition for an underlying processes by using correlation/contributions (loadings) of observed variable in a factor/component (Tabachnick & Fidell, 2013).

实际上，PCA或类似分析(因素分析(FA))从根本上允许相互关联但基本独立于变量的其他子集的变量组合为组件/因子。 PCA和FA都总结了观察到的变量之间的相关关系模式，并将大量观察到的变量(特征/尺寸)减少为更少的组件/因子。通常，这些因素/成分分析通过使用因素/成分中观察变量的相关性/贡献(负荷)来为基础过程产生可操作的定义(Tabachnick＆Fidell，2013)。

In order to obtain the PCA eigenvectors (i.e. cosines of rotation of variables into components) and eigenvalues (i.e. the proportion of overall variance explained) to calculate the loading of each component (i.e. eigenvectors normalized to respective eigenvalues; loadings are the covariances between variables and components), we exported the data from the anndata to a pandas dataframe and re-performed the pca analysis using the sklearn.decomposition.PCA.

为了获得PCA特征向量(即变量旋转成分量的余弦值)和特征值(即解释了总方差的比例)来计算每个分量的负荷(即归一化为各自特征值的特征向量；负荷是变量与变量之间的协方差)组件)，我们将数据从andata导出到pandas数据框，然后使用sklearn.decomposition.PCA重新执行pca分析。

We can calculate loadings of PCA for each components using the following formula:-

我们可以使用以下公式来计算每个组件的PCA负载：

From the heatmap below, we can see some patterns of loadings explaining each PCs, but they are not too obvious. This is because the goal of PCA is to extract maximum variance from a dataset with a few orthogonal components, in order to provide only an empirical summary of the dataset. We proceed further for the Factor Analysis to see if the loadings can provide a clearer underlying patterns for each factor.

从下面的热图中，我们可以看到一些解释每台PC的负载模式，但它们并不是太明显。这是因为PCA的目标是从具有几个正交分量的数据集中提取最大方差，以便仅提供该数据集的经验总结。我们进一步进行因子分析，以查看加载是否可以为每个因子提供更清晰的基础模式。

5.因子分析(FA) (5. Factor analysis (FA))

In contrast to PCA, the goal of FA (if it is orthogonal rotation) is to reproduce the correlation matrix with a few orthogonal factors.

与PCA相比，FA(如果是正交旋转)的目标是再现具有几个正交因子的相关矩阵。

R的因式分解测试 (Testings for Factorability of R)

A matrix that is ‘factorable’ should include severable sizable correlations. The expected size depends, to some extent, on N (larger sample sizes tend to produce smaller correlations), but if no correlation exceed 0.30, use of FA is questionable because there is probably nothing to factor-analysed.

“可分解”的矩阵应包括可分割的可观相关性。预期的大小在某种程度上取决于N(较大的样本大小往往会产生较小的相关性)，但是如果相关性不超过0.30，则FA的使用将是有问题的，因为可能没有任何因素可进行分析。

The Barlett test of Sphericity is a not-so-good sensitive test of the hypothesis that the correlation in a correlation matrix are zero — the test is likely to be significant with substantial/large sample size, even if the correlation are very low. Therefore, the test is recommended only if there are fewer than 5 samples per variable. In our case, we have 23 variables with 505 samples (companies) in total (we can only maximum of 115 samples to qualify for this test’s recommended sample size) — we are not qualified for this test. However, we are just going to perform this to see if it is significant. The P-value is 0 (significant), although it is not a reliable “preferable conclusion”.

球形度的Barlett检验对于假设相关矩阵中的相关性为零的假设不太敏感，因此即使相关性非常低，该测试对于大量样本也可能很重要。因此，仅在每个变量少于5个样本的情况下才建议进行测试。在我们的案例中，我们共有23个变量，共有505个样本(公司)(我们最多只能有115个样本才能符合此测试的建议样本量)—我们没有资格进行此测试。但是，我们将执行此操作以查看它是否有意义。尽管P值不是一个可靠的“最佳结论”，但其值为0(有效)。

Alternatively, Kaiser-Meyer-Olkin test (Kaiser’s measure of sampling adequacy) is a ratio of the sum of squared correlations to the sum of squared correlations plus sum of squared partial correlation. The value approaches 1 if partial correlation are small. Values of 0.6 and above are recommended for a a good FA.

备选地，Kaiser-Meyer-Olkin检验(抽样适度的Kaiser量度)是平方相关之和与平方相关之和加偏相关之和之比。如果部分相关较小，则该值接近1。为了获得良好的FA，建议使用0.6或更高的值。

估计因子数并过滤社区> 0.2的变量 (Estimating number of factors and filtering for variables with communalities > 0.2)

Here, as we are only interested to know how many significant factors we should use for the following work, we used a simple factor analysis without any rotation. In fact, we can just use the estimation based on the PCA variance ratio (eigenvalue) above.

在这里，由于我们只想知道在接下来的工作中应该使用多少个重要因素，因此我们使用了简单的因素分析方法，没有任何轮换。实际上，我们可以仅使用基于以上PCA方差比(特征值)的估计。

Again, we see that the elbow point on the screen plot fall at about F5. Next, the communalities of the variables/features are inspected to see if the variables are well defined by the solution. Communalities indicate the percent of variance in a variable that overlaps variance in the factors. Ideally, we should drop variables with low communalities, for example, exclude those variables with <0.2 communalities. Here, found only 22 of 34 variables had commualities >0.2. However, for the exploratory purpose of this dataset, we don’t consider removing those variables.

再次，我们看到屏幕图上的弯头落在大约F5处。接下来，检查变量/功能的社区，以查看解决方案是否很好地定义了变量。社区表示变量中与因子方差重叠的方差百分比。理想情况下，我们应该删除社区数量较低的变量，例如，排除社区数量小于0.2的变量。在这里，发现34个变量中只有22个的商数> 0.2。但是，出于探索性目的，我们不考虑删除这些变量。

使用Varimax(正交)旋转和最大似然因子提取方法进行因子分析 (Factor analysis with Varimax (orthogonal) rotation and Maximum Likelihood factor extraction method)

We can revisit the correlation matrix plot of the features/variables of this dataset — shown below. The underlying variables in respective factors (the following figure) are usually highly correlated but least correlated with the underlying variables in other factors.

我们可以重新查看此数据集的特征/变量的相关矩阵图-如下所示。各个因素中的基础变量(下图)通常高度相关，而与其他因素中的基础变量之间的相关性最低。

As a rule of thumb, only variables with loading of 0.32 and above are interpreted. The greater the loading, the more the vrialbe is a pure measure of the factor. Comrey and Lee (1992) suggest that loadings of:-

根据经验，仅解释负载为0.32及更高的变量。负载越大，vrialbe可以作为因子的纯量度。 Comrey and Lee(1992)提出：

>71% (50% overlapping variance) are considered excellent;>63% (40% overlapping variance) very good;>55% (30% overlapping variance) good;>45% (20% overlapping variance) fair;<32% (10% overlapping variance) poor;

> 71％(50％重叠方差)被认为是优秀;> 63％(40％重叠方差)是很好;> 55％(30％重叠方差)是良好;> 45％(20％重叠方差)是公平的; <32％ (10％重叠方差)差；

总之，基于上面的热图： (In summary, based on the heatmap above:)

the Factor 1 (FA1) seems to suggest operational/return performance or valuations of a company;
因素1(FA1)似乎暗示着公司的运营/回报表现或估值；
FA2 is more correlated to volalities of company share values/market capitals;
FA2与公司股票价值/市值的波动更相关；
FA3 seems to suggest long term debt obligations;
FA3似乎暗示了长期债务义务；
FA4 seems to suggest short-term debt obligation (i.e. quick/current ratio); and lastly
FA4似乎暗示了短期债务义务(即速动比率/流动比率)；最后
FA5 seems to be correlated with gross margin performance of a company.
FA5似乎与公司的毛利率表现相关。

5.最终作者的意见 (5. Final author’s opinion)

Factor analysis (FA) dimensional reduction seems to be a better approach than PCA (which is a more empirical summary approach compared to FA) if we are interested in a theoretical solution uncontaminated by unique and error variability and have designed our study on the basis of underlying constructs that are expected to produce scores on the observed variables. In this FA study, we were using Varimax rotation (an orthogonal rotation so that all the factors are uncorrelated with each other) — this could in turn be more useful latent variables for factor model in both multiple linear or logistic-based regressions, eg. to predict positive or negative return of a stock price performance. We will try out these latent variables/factors to see if they work out well in predicting these events in the next article.

如果我们对不受唯一性和误差可变性污染的理论解决方案感兴趣，并且基于以下因素设计了研究，则因子分析(FA)降维似乎比PCA(比FA更具经验总结方法)更好。预期会在观察到的变量上产生得分的基础构造。在此FA研究中，我们使用了Varimax旋转(正交旋转，以便所有因素彼此不相关)—这反过来可能是在多个线性或基于逻辑回归的因素模型中更有用的潜在变量。预测股价表现的正收益或负收益。我们将在下一篇文章中尝试这些潜在的变量/因素，以查看它们是否能很好地预测这些事件。