使用Apache Spark进行预测性数据分析--PCA篇

本文是Boutros El-Gamil的使用Apache Spark进行预测性数据分析系列文章的第四篇,http://www.data-automaton.com/2019/01/10/predictive-data-analytics-with-apache-spark-part-4-principal-component-analysis/

前三篇分别是

  1. 使用Apache Spark进行预测性数据分析--简介篇

  2. 使用Apache Spark进行预测性数据分析--数据准备篇

  3. 使用Apache Spark进行预测性数据分析--特征工程篇


上一篇文章中,我们生成了数十个现有特征中的新特征。这意味着我们的数据维度已大大扩展。为了在预测数据建模中走得更远,我们需要降低数据维数。减少数据的尺寸使我们能够更有效地可视化数据,测试机器学习算法的不同参数设置以优化我们的预测解决方案,并充分利用内存和存储实用程序。减少维数的最常见程序之一是主成分分析PCA)算法。以下功能将PCA算法应用于Spark Dataframe。它获取Spark DF作为输入,减少数字特征的子集,并生成目标数量的主成分(PC)特征。该函数在将PC附加到其后返回相同的输入Dataframe,并返回所生成PC的差异列表。

%%timefrom  functools import reduce

def extract(row): ''' This function extract PC's features row by row as RDD object INPUTS: @row: ''' return (row.key, ) + tuple(float(x) for x in row.pcaFeatures.values)
def add_PCA_features(df, features, PC_num): ''' This function add PCs to Spark dataframe. INPUTS: @df: Spark dataframe @features: list of data features in spark_df @PC_num: number of required PCs OUTPUTS: @df_new: Updated Spark dataframe @pca_variance: list of variances of PCA features ''' # create unique ID for @df df = df.withColumn("key", F.monotonically_increasing_id()) '''1. use RFormula to create the feature vector In this step, we generate an ID column for scaled feature DF, and using this ID column to join data features into one Vector.''' # Use RFormula to create feature vector formula = RFormula(formula = "~" + "+".join(features))
# create ML pipeline for feature processing pipeline = formula.fit(df).transform(df)
# select both "key" and "features" out of pipeline output = pipeline.select("key", "features")
'''2. build PCA model, and fit data to it In this step, we build a PCA model with 6 desired PCs, and train (fit) our features to that model. ''' # init PCA object with "features" as input and "pcaFeatures" as output pca = PCA(k=PC_num, inputCol="features", outputCol="pcaFeatures")
# build PCA model using output pipeline model = pca.fit(output)
# get PCA result by fitting features into PCA result = model.transform(output).select("key", "pcaFeatures")
# get vector of variances covered by each PC pca_variance = model.explainedVariance
''' 3. convert PCs to Dataframe columns** In this step, we convert the generated PCs from last step, and append it to the scale data dataframe. ''' # get PCA output as new Spark dataframe pca_outcome = result.rdd.map(extract).toDF(["key"])
# get columns names of pca_outcome oldColumns = pca_outcome.schema.names
# set new names for PCA features newColumns = ["key"]
for i in range (1, PC_num + 1): newColumns.append('PCA_' + str(i))
# add new columns names to PCA dataframe pca_result = reduce(lambda pca_outcome, idx: pca_outcome.withColumnRenamed(oldColumns[idx], newColumns[idx]), \ range(len(oldColumns)), pca_outcome)
# join PCA df to data df df_new = df.join(pca_result, 'key', 'left') return df_new, pca_variance
# set list of normalized and standardized data featuresfeatures_to_PCA = sort_alphanumerically([s for s in set(train_df.columns) if ("_4" in s)])
# add PCs to data featurestrain_df, train_pca_variance = add_PCA_features(train_df, features_to_PCA, 10)print ("Train Data PCA Variances: ", train_pca_variance)
test_df, test_pca_variance = add_PCA_features(test_df, features_to_PCA, 10)print ("Test Data PCA Variances: ", test_pca_variance)

执行完上述函数后,我们获得了所需数量的PC特征。下图描述了训练数据集和测试数据集中的前10个PC获得的累积数据方差。

如上图所示,PCA算法设法压缩了前4个PC功能中的几乎所有数据差异。这意味着我们设法在不损失方差的情况下有效地将数据维数从24个减少到4个。现在,让我们可视化我们的PC特征。

如上图所示,生成的PC具有不同的比例。因此,我们需要使用归一化和标准化程序来缩放这些功能。

2.归一化PC特征

为了归一化PC特征,我们需要在所有引擎上应用归一化过程(即我们不想在归一化之前按每个引擎对数据进行分区)。原因是我们希望在所有引擎上生成全局归一化的特征,以供稍后在学习过程中使用。以下功能添加了PC的全局归一化特征。

%%time
# Normalizationdef add_normalized_features_unpartitioned(df, features): ''' this function squashes columns in Spark dataframe in [0,1] domain. INPUTS: @df: Spark dataframe @features: list of data features in @df to be normalized OUTPUTS: @df: updated Spark Dataframe ''' # loop over data features for f in features: # compute min, max values for data feature 'f' cur_min = df.agg({f: "min"}).collect()[0][0] cur_max = df.agg({f: "max"}).collect()[0][0]
print (f, ' cur_min: ', cur_min, ' cur_max: ', cur_max )
# create UDF instance normalize_Udf = F.udf(lambda value: (value - cur_min) / (cur_max - cur_min), DoubleType())
# build normalized DF of data features df = df.withColumn(f + '_norm', normalize_Udf(df[f]))
return df
# get PC featurespca_features = sort_alphanumerically([s for s in set(train_df.columns) if ("PCA_" in s and s.count('_') == 1)])
# add normalized features to dfprint ("Train Data:\n")train_df = add_normalized_features_unpartitioned(train_df, pca_features)
print ("\nTest Data:\n")test_df = add_normalized_features_unpartitioned(test_df, pca_features)

3.标准化PC特征

与PC归一化一样,我们也可以生成标准化的PC特征,而无需按引擎划分数据。以下功能为我们完成了工作。

%%timedef add_standardized_features_unpartitioned(df, features):    '''    this function add standard features with 0 mean and unit variance for each data feature in Spark DF        INPUTS:    @df: Spark Dataframe    @features: list of data features in @df to be standerdized        OUTPUTS:    @df: updated Spark Dataframe     '''        for f in features:        # compute min, max values for each data feature        cur_mean = float(df.describe(f).filter("summary = 'mean'").select(f).collect()[0].asDict()[f])        cur_std = float(df.describe(f).filter("summary = 'stddev'").select(f).collect()[0].asDict()[f])
print (f, ' cur_mean: ', cur_mean, ' cur_std: ', cur_std )
# create UDF instance step 1 (subtract mean from feature) standardize_Udf = F.udf(lambda value: (value - cur_mean) / cur_std, DoubleType())
# add standardized data features df = df.withColumn(f + '_scaled', standardize_Udf(df[f])) return df
# get PC featurespca_features = sort_alphanumerically([s for s in set(train_df.columns) if ("PCA_" in s and s.count('_') == 1)])
# add standardized features to dfprint ("Train Data:")train_df = add_standardized_features_unpartitioned(train_df, pca_features)
print ("\nTest Data:")test_df = add_standardized_features_unpartitioned(test_df, pca_features)

下图显示了所有引擎的标准PC特征。

4.以Parquet格式保存数据

在应用了上面列出的所有数据准备和特征工程步骤之后,现在是时候将我们的Spark Dataframes保存到存储单元了。在Apache Spark中,最推荐的保存数据框的格式是Parquet,这是一种列式存储格式,可以在减少的存储空间中存储具有多列的大型数据集。以下几行以.parquet格式保存训练数据帧和测试数据帧。

train_df.write.mode('overwrite').parquet(os.getcwd() + '/' + path +  'train_FD001_preprocessed.parquet')test_df.write.mode('overwrite').parquet(os.getcwd() + '/' + path +  'test_FD001_preprocessed.parquet')

5.完整代码

可以在我的Github中 https://github.com/boutrosrg/Predictive-Maintenance-In-PySpark中找到本教程的代码。


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值