本文是Boutros El-Gamil的使用Apache Spark进行预测性数据分析系列文章的第四篇,http://www.data-automaton.com/2019/01/10/predictive-data-analytics-with-apache-spark-part-4-principal-component-analysis/
前三篇分别是
在上一篇文章中,我们生成了数十个现有特征中的新特征。这意味着我们的数据维度已大大扩展。为了在预测数据建模中走得更远,我们需要降低数据维数。减少数据的尺寸使我们能够更有效地可视化数据,测试机器学习算法的不同参数设置以优化我们的预测解决方案,并充分利用内存和存储实用程序。减少维数的最常见程序之一是主成分分析(PCA)算法。以下功能将PCA算法应用于Spark Dataframe。它获取Spark DF作为输入,减少数字特征的子集,并生成目标数量的主成分(PC)特征。该函数在将PC附加到其后返回相同的输入Dataframe,并返回所生成PC的差异列表。
%%time
from functools import reduce
def extract(row):
'''
This function extract PC's features row by row as RDD object
INPUTS:
@row:
'''
return (row.key, ) + tuple(float(x) for x in row.pcaFeatures.values)
def add_PCA_features(df, features, PC_num):
'''
This function add PCs to Spark dataframe.
INPUTS:
@df: Spark dataframe
@features: list of data features in spark_df
@PC_num: number of required PCs
OUTPUTS:
@df_new: Updated Spark dataframe
@pca_variance: list of variances of PCA features
'''
# create unique ID for @df
df = df.withColumn("key", F.monotonically_increasing_id())
'''1. use RFormula to create the feature vector
In this step, we generate an ID column for scaled feature DF, and using this ID column to join data
features into one Vector.'''
# Use RFormula to create feature vector
formula = RFormula(formula = "~" + "+".join(features))
# create ML pipeline for feature processing
pipeline = formula.fit(df).transform(df)
# select both "key" and "features" out of pipeline
output = pipeline.select("key", "features")
'''2. build PCA model, and fit data to it
In this step, we build a PCA model with 6 desired PCs, and train (fit) our features to that model. '''
# init PCA object with "features" as input and "pcaFeatures" as output
pca = PCA(k=PC_num, inputCol="features", outputCol="pcaFeatures")
# build PCA model using output pipeline
model = pca.fit(output)
# get PCA result by fitting features into PCA
result = model.transform(output).select("key", "pcaFeatures")
# get vector of variances covered by each PC
pca_variance = model.explainedVariance
''' 3. convert PCs to Dataframe columns**
In this step, we convert the generated PCs from last step, and append it to the scale data dataframe.
'''
# get PCA output as new Spark dataframe
pca_outcome = result.rdd.map(extract).toDF(["key"])
# get columns names of pca_outcome
oldColumns = pca_outcome.schema.names
# set new names for PCA features
newColumns = ["key"]
for i in range (1, PC_num + 1):
newColumns.append('PCA_' + str(i))
# add new columns names to PCA dataframe
pca_result = reduce(lambda pca_outcome, idx: pca_outcome.withColumnRenamed(oldColumns[idx], newColumns[idx]), \
range(len(oldColumns)), pca_outcome)
# join PCA df to data df
df_new = df.join(pca_result, 'key', 'left')
return df_new, pca_variance
# set list of normalized and standardized data features
features_to_PCA = sort_alphanumerically([s for s in set(train_df.columns) if ("_4" in s)])
# add PCs to data features
train_df, train_pca_variance = add_PCA_features(train_df, features_to_PCA, 10)
print ("Train Data PCA Variances: ", train_pca_variance)
test_df, test_pca_variance = add_PCA_features(test_df, features_to_PCA, 10)
print ("Test Data PCA Variances: ", test_pca_variance)
执行完上述函数后,我们获得了所需数量的PC特征。下图描述了训练数据集和测试数据集中的前10个PC获得的累积数据方差。
如上图所示,PCA算法设法压缩了前4个PC功能中的几乎所有数据差异。这意味着我们设法在不损失方差的情况下有效地将数据维数从24个减少到4个。现在,让我们可视化我们的PC特征。
如上图所示,生成的PC具有不同的比例。因此,我们需要使用归一化和标准化程序来缩放这些功能。
2.归一化PC特征
为了归一化PC特征,我们需要在所有引擎上应用归一化过程(即我们不想在归一化之前按每个引擎对数据进行分区)。原因是我们希望在所有引擎上生成全局归一化的特征,以供稍后在学习过程中使用。以下功能添加了PC的全局归一化特征。
%%time
# Normalization
def add_normalized_features_unpartitioned(df, features):
'''
this function squashes columns in Spark dataframe in [0,1] domain.
INPUTS:
@df: Spark dataframe
@features: list of data features in @df to be normalized
OUTPUTS:
@df: updated Spark Dataframe
'''
# loop over data features
for f in features:
# compute min, max values for data feature 'f'
cur_min = df.agg({f: "min"}).collect()[0][0]
cur_max = df.agg({f: "max"}).collect()[0][0]
print (f, ' cur_min: ', cur_min, ' cur_max: ', cur_max )
# create UDF instance
normalize_Udf = F.udf(lambda value: (value - cur_min) / (cur_max - cur_min), DoubleType())
# build normalized DF of data features
df = df.withColumn(f + '_norm', normalize_Udf(df[f]))
return df
# get PC features
pca_features = sort_alphanumerically([s for s in set(train_df.columns) if ("PCA_" in s and s.count('_') == 1)])
# add normalized features to df
print ("Train Data:\n")
train_df = add_normalized_features_unpartitioned(train_df, pca_features)
print ("\nTest Data:\n")
test_df = add_normalized_features_unpartitioned(test_df, pca_features)
3.标准化PC特征
与PC归一化一样,我们也可以生成标准化的PC特征,而无需按引擎划分数据。以下功能为我们完成了工作。
%%time
def add_standardized_features_unpartitioned(df, features):
'''
this function add standard features with 0 mean and unit variance for each data feature in Spark DF
INPUTS:
@df: Spark Dataframe
@features: list of data features in @df to be standerdized
OUTPUTS:
@df: updated Spark Dataframe
'''
for f in features:
# compute min, max values for each data feature
cur_mean = float(df.describe(f).filter("summary = 'mean'").select(f).collect()[0].asDict()[f])
cur_std = float(df.describe(f).filter("summary = 'stddev'").select(f).collect()[0].asDict()[f])
print (f, ' cur_mean: ', cur_mean, ' cur_std: ', cur_std )
# create UDF instance step 1 (subtract mean from feature)
standardize_Udf = F.udf(lambda value: (value - cur_mean) / cur_std, DoubleType())
# add standardized data features
df = df.withColumn(f + '_scaled', standardize_Udf(df[f]))
return df
# get PC features
pca_features = sort_alphanumerically([s for s in set(train_df.columns) if ("PCA_" in s and s.count('_') == 1)])
# add standardized features to df
print ("Train Data:")
train_df = add_standardized_features_unpartitioned(train_df, pca_features)
print ("\nTest Data:")
test_df = add_standardized_features_unpartitioned(test_df, pca_features)
下图显示了所有引擎的标准PC特征。
4.以Parquet格式保存数据
在应用了上面列出的所有数据准备和特征工程步骤之后,现在是时候将我们的Spark Dataframes保存到存储单元了。在Apache Spark中,最推荐的保存数据框的格式是Parquet,这是一种列式存储格式,可以在减少的存储空间中存储具有多列的大型数据集。以下几行以.parquet格式保存训练数据帧和测试数据帧。
train_df.write.mode('overwrite').parquet(os.getcwd() + '/' + path + 'train_FD001_preprocessed.parquet')
test_df.write.mode('overwrite').parquet(os.getcwd() + '/' + path + 'test_FD001_preprocessed.parquet')
5.完整代码
可以在我的Github中 https://github.com/boutrosrg/Predictive-Maintenance-In-PySpark中找到本教程的代码。