大规模数据的PCA降维

最新推荐文章于 2022-06-14 20:36:08 发布

置顶 V丶Chao

最新推荐文章于 2022-06-14 20:36:08 发布

阅读量1.8k

点赞数 1

分类专栏： Spark 数据挖掘机器学习文章标签： python 大数据 spark pca降维

本文链接：https://blog.csdn.net/u011698800/article/details/107915834

版权

机器学习同时被 3 个专栏收录

43 篇文章 3 订阅

订阅专栏

Spark

13 篇文章 0 订阅

订阅专栏

数据挖掘

8 篇文章 2 订阅

订阅专栏

20200810 -

0. 引言

最近在做的文本可视化的内容，文本处理的方法是利用sklearn的CountVer+Tf-idf，这样处理数据之后，一方面数据的维度比较高，另一方面呢，本身这部分数据量也比较大。如果直接使用sklearn的pca进行降维，会很慢，而且pca也没有n_jobs来支持多线程工作。不过，我看到spark中已经支持的pca了，所以希望通过spark来实现这部分内容。

1. spark的PCA算法

1.1 官方使用示例

>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
...     (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
...     (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
>>> df = spark.createDataFrame(data,["features"])
>>> pca = PCA(k=2, inputCol="features", outputCol="pca_features")
>>> model = pca.fit(df)
>>> model.transform(df).collect()[0].pca_features
DenseVector([1.648..., -4.013...])
>>> model.explainedVariance
DenseVector([0.794..., 0.205...])
>>> pcaPath = temp_path + "/pca"
>>> pca.save(pcaPath)
>>> loadedPca = PCA.load(pcaPath)
>>> loadedPca.getK() == pca.getK()
True
>>> modelPath = temp_path + "/pca-model"
>>> model.save(modelPath)
>>> loadedModel = PCAModel.load(modelPath)
>>> loadedModel.pc == model.pc
True
>>> loadedModel.explainedVariance == model.explainedVariance
True

上面的代码是spark的官方文档（2.4.4版本）的实例介绍；从中可以看出，对于PCA使用过程来说，没有什么不一样的。
其实我比较关注的是，假设，我开始的数据，更直接点说就是已经经过预处理的数据，他们现在是一个高维的向量，他们的数据类型是numpy.narray，这种形式的数据怎么传递到spark中，然后应用上面部分的代码。那么，比较关键的地方就是上面的data部分，这部分应该怎么处理。

1.2 个人使用方式

在谷歌搜索的时候，发现了某个代码[1]；他利用iris数据作为示例来进行讲解，下面来看看他具体的步骤。

1.2.1 加载数据

iris = load_iris()
X = iris['data']
y = iris['target']

data = pd.DataFrame(X, columns = iris.feature_names)
dataset = spark.createDataFrame(data, iris.feature_names)
dataset.show(6)

上述代码的步骤如下：
1）加载iris数据
2）创建pandas的DF
3）创建spark的DF
也就是说，这个时候就创建了numpy与spark中df的联系。

1.2.2 将向量集中于一列

# specify the input columns' name and
# the combined output column's name
assembler = VectorAssembler(
    inputCols = iris.feature_names, outputCol = 'features')
    
# use it to transform the dataset and select just
# the output column
df = assembler.transform(dataset).select('features')
df.show(6)
# output :
'''
+-----------------+
|         features|
+-----------------+
|[5.1,3.5,1.4,0.2]|
|[4.9,3.0,1.4,0.2]|
|[4.7,3.2,1.3,0.2]|
|[4.6,3.1,1.5,0.2]|
|[5.0,3.6,1.4,0.2]|
|[5.4,3.9,1.7,0.4]|
+-----------------+
only showing top 6 rows
'''

在之前的一篇文章《Spark机器学习实例》中也有这部分的操作，不过当时代码是这样的：

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
data = iris_data.rdd.map(lambda row: LabeledPoint(row[-1], Vectors.dense(row[:-1])))

本质上是一个道理，都是讲特征部分汇总到一个向量中。

1.2.3 向量归一化

scaler = StandardScaler(
    inputCol = 'features', 
    outputCol = 'scaledFeatures',
    withMean = True,
    withStd = True
).fit(df)

# when we transform the dataframe, the old
# feature will still remain in it
df_scaled = scaler.transform(df)
df_scaled.show(6)

这部分没什么可说的，就是归一化向量，然后应用于PCA。

1.2.4 使用PCA

n_components = 2
pca = PCA(
    k = n_components, 
    inputCol = 'scaledFeatures', 
    outputCol = 'pcaFeatures'
).fit(df_scaled)

df_pca = pca.transform(df_scaled)
print('Explained Variance Ratio', pca.explainedVariance.toArray())
df_pca.show(6)

上面部分代码是做PCA的核心部分，通过这部分内容就可以得到相应的降维数据了。

1.2.5 取出降维后的数据

# not sure if this is the best way to do it
X_pca = df_pca.rdd.map(lambda row: row.pcaFeatures).collect()
X_pca = np.array(X_pca)

这部分代码应该可以直接使用key来获取某一列，不需要再用RDD

1.3 小节

关于这部分内容，关于PCA的使用没有什么多说的，只需要按照其接口说明传递参数然后获取相应的内容即可。关键部分是前期的预处理，怎么将数据弄成满足后续PCA处理形式的数据。
前面的代码应该就足够了，后续针对自己的数据来实践以下。

参考文章

[1]spark pca

V丶Chao

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
大规模数据的PCA降维

20200810 -0. 引言最近在做的文本可视化的内容，文本处理的方法是利用sklearn的CountVer+Tf-idf，这样处理数据之后，一方面数据的维度比较高，另一方面呢，本身这部分数据量也比较大。如果直接使用sklearn的pca进行降维，会很慢，而且pca也没有n_jobs来支持多线程工作。不过，我看到spark中已经支持的pca了，所以希望通过spark来实现这部分内容。1. spark的PCA算法1.1 官方使用示例>>> from pyspark.ml.lina
复制链接

扫一扫