【Spark ML】第 4 章：无监督学习-CSDN博客

本文链接：https://blog.csdn.net/sikh_0529/article/details/127470987

🔎大家好，我是Sonhhxg_柒，希望你看完之后，能对你有所帮助，不足请指正！共同学习交流🔎

📝个人主页－Sonhhxg_柒的博客_CSDN博客 📃

🎁欢迎各位→点赞👍 + 收藏⭐️ + 留言📝

📣系列专栏 - 机器学习【ML】自然语言处理【NLP】深度学习【DL】

🖍foreword

✔说明⇢本人讲解主要包括Python、机器学习（ML）、深度学习（DL）、自然语言处理（NLP）等内容。

如果你对这个系列感兴趣的话，可以关注订阅哟👋

文章目录

使用 K 均值进行聚类分析

例子

使用潜在狄利克雷分配（LDA）进行主题建模

Stanford CoreNLP for Spark

Spark NLP from John Snow Labs

预训练的管道

具有 Spark 数据帧的预训练管道

使用Spark MLlib 管道的预训练管道

创建您自己的Spark MLlib 管道

Spark NLP LightPipeline

Spark NLP OCR Module

例子

使用隔离林进行异常检测

参数

例子

使用主成分分析进行降维

例子

总结

无监督学习是一种机器学习任务，无需标记响应的帮助即可在数据集中查找隐藏的模式和结构。当您只能访问输入数据并且训练数据不可用或难以获取时，无监督学习是理想的选择。常用方法包括聚类分析、主题建模、异常检测和主成分分析。

使用 K 均值进行聚类分析

聚类分析是一项无监督的机器学习任务，用于对具有一些相似之处的未标记观察结果进行分组。流行的群集用例包括客户细分、欺诈分析和异常检测。聚类分析还经常用于在训练数据稀缺或不可用的情况下为分类器生成训练数据。K-Means是用于聚类的最流行的无监督学习算法之一。Spark MLlib 包括一个更具可扩展性的 K 均值实现，称为 K 均值||。图 4-1 显示了将 Iris 数据集中的观测值分组为三个不同的聚类的 K 均值。

图 4-1使用 K 均值聚类分析 Iris 数据集
图 4-2 显示了运行中的 K 均值算法，观测值显示为正方形，聚类质心显示为三角形。图 4-2 （a）显示了原始数据集。K-Means的工作原理是随机分配用作每个聚类起点的质心（图4-2（b）和（c））。该算法根据欧氏距离以迭代方式将每个数据点分配给最近的质心。然后，它通过计算属于该聚类的所有点的平均值来计算每个聚类的新质心（图 4-2 （d）和（e））。当达到预定义的迭代次数或将每个数据点分配给其最近的质心并且没有可以执行的重新分配时，该算法将停止迭代（图 4-2 （f））。

图 4-2K 均值算法的实际应用第二
K-Means 要求用户向算法提供聚类数 k。有多种方法可以找到数据集的最佳聚类数。我们将在本章后面讨论肘部和轮廓方法。

例子

让我们研究一个简单的客户细分示例。我们将使用一个包含七个观测值的小型数据集，以及三个分类特征和两个连续特征的混合。在开始之前，我们需要解决K均值的另一个限制。K-Means 不能直接处理分类特征，例如性别（“M”，“F”）、婚姻状况（“M”，“S”）和状态（“CA”，“NY”），并要求所有特征都是连续的。但是，实际数据集通常包含分类特征和连续特征的组合。幸运的是，我们仍然可以通过将其转换为数字格式来使用具有分类特征的K均值。

这并不像听起来那么简单。例如，要将婚姻状况从其字符串表示形式“M”和“S”转换为数字，您可能会认为将0映射到“M”，将1映射到“S”适用于K均值。正如您在第2章中了解到的，这被称为整数或标签编码。但这引入了另一个皱纹。整数具有自然排序（0 < 1 < 2），一些机器学习算法（如 K-means）可能会误解，认为一个分类值“大于”另一个，仅仅是因为它被编码为整数，而实际上数据中不存在这样的序数关系。这可能会产生意外的结果。为了解决这个问题，我们将使用另一种称为一热编码的编码。第三将分类特征转换为整数（使用字符串索引器）后，我们使用一热编码（使用 OneHotEncoder 指数）将分类特征表示为二进制向量。例如，状态特征（“CA”、“NY”、“MA”、“AZ”）是表 4-1 中的一个热编码。
表4-1单热编码状CA

CA	NY	MA	AZ
1	0	0	0
0	1	0	0
0	0	1	0
0	0	0	1

特征缩放是 K 均值的另一个重要预处理步骤。如第 2 章所述，特征缩放被认为是最佳实践，也是许多涉及距离计算的机器学习算法的要求。如果以不同的比例测量数据，则要素缩放尤其重要。某些要素的值范围可能非常宽，从而导致它们主导其他要素。要素缩放可确保每个要素按比例计入最终距离。对于我们的示例，我们将使用 StandardScaler 估计器重新缩放我们的特征，使其具有均值为 0 和单位方差（标准差为 1），如清单 4-1 所示。
让我们从创建一些示例数据的示例开始。

// 让我们从创建一些示例数据的示例开始。
val custDF = Seq(
(100, 29000,"M","F","CA",25),
(101, 36000,"M","M","CA",46),
(102, 5000,"S","F","NY",18),
(103, 68000,"S","M","AZ",39),
(104, 2000,"S","F","CA",16),
(105, 75000,"S","F","CA",41),
(106, 90000,"M","M","MA",47),
(107, 87000,"S","M","NY",38)
).toDF("customerid", "income","maritalstatus","gender","state","age")
// 执行一些预处理步骤。
import org.apache.spark.ml.feature.StringIndexer
val genderIndexer = new StringIndexer()
                    .setInputCol("gender")
                    .setOutputCol("gender_idx")
val stateIndexer = new StringIndexer()
                   .setInputCol("state")
                   .setOutputCol("state_idx")
val mstatusIndexer = new StringIndexer()
                     .setInputCol("maritalstatus")
                     .setOutputCol("maritalstatus_idx")
import org.apache.spark.ml.feature.OneHotEncoderEstimator
val encoder = new OneHotEncoderEstimator()
              .setInputCols(Array("gender_idx","state_idx","maritalstatus_idx"))
              .setOutputCols(Array("gender_enc","state_enc","maritalstatus_enc"))
val custDF2 = genderIndexer.fit(custDF).transform(custDF)
val custDF3 = stateIndexer.fit(custDF2).transform(custDF2)
val custDF4 = mstatusIndexer.fit(custDF3).transform(custDF3)
custDF4.select("gender_idx","state_idx","maritalstatus_idx").show
+----------+---------+-----------------+
|gender_idx|state_idx|maritalstatus_idx|
+----------+---------+-----------------+
|       0.0|      0.0|              1.0|
|       1.0|      0.0|              1.0|
|       0.0|      1.0|              0.0|
|       1.0|      3.0|              0.0|
|       0.0|      0.0|              0.0|
|       0.0|      0.0|              0.0|
|       1.0|      2.0|              1.0|
|       1.0|      1.0|              0.0|
+----------+---------+-----------------+
val custDF5 = encoder.fit(custDF4).transform(custDF4)
custDF5.printSchema
root
 |-- customerid: integer (nullable = false)
 |-- income: integer (nullable = false)
 |-- maritalstatus: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- state: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- gender_idx: double (nullable = false)
 |-- state_idx: double (nullable = false)
 |-- maritalstatus_idx: double (nullable = false)
 |-- gender_enc: vector (nullable = true)
 |-- state_enc: vector (nullable = true)
 |-- maritalstatus_enc: vector (nullable = true)
custDF5.select("gender_enc","state_enc","maritalstatus_enc").show
+-------------+-------------+-----------------+
|   gender_enc|    state_enc|maritalstatus_enc|
+-------------+-------------+-----------------+
|(1,[0],[1.0])|(3,[0],[1.0])|        (1,[],[])|
|    (1,[],[])|(3,[0],[1.0])|        (1,[],[])|
|(1,[0],[1.0])|(3,[1],[1.0])|    (1,[0],[1.0])|
|    (1,[],[])|    (3,[],[])|    (1,[0],[1.0])|
|(1,[0],[1.0])|(3,[0],[1.0])|    (1,[0],[1.0])|
|(1,[0],[1.0])|(3,[0],[1.0])|    (1,[0],[1.0])|
|    (1,[],[])|(3,[2],[1.0])|        (1,[],[])|
|    (1,[],[])|(3,[1],[1.0])|    (1,[0],[1.0])|
+-------------+-------------+-----------------+
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
                .setInputCols(Array("income","gender_enc", "state_enc", "maritalstatus_enc", "age"))
                .setOutputCol("features")
val custDF6 = assembler.transform(custDF5)
custDF6.printSchema
root
 |-- customerid: integer (nullable = false)
 |-- income: integer (nullable = false)
 |-- maritalstatus: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- state: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- gender_idx: double (nullable = false)
 |-- state_idx: double (nullable = false)
 |-- maritalstatus_idx: double (nullable = false)
 |-- gender_enc: vector (nullable = true)
 |-- state_enc: vector (nullable = true)
 |-- maritalstatus_enc: vector (nullable = true)
 |-- features: vector (nullable = true)
custDF6.select("features").show(false)
+----------------------------------+
|features                          |
+----------------------------------+
|[29000.0,1.0,1.0,0.0,0.0,0.0,25.0]|
|(7,[0,2,6],[36000.0,1.0,46.0])    |
|[5000.0,1.0,0.0,1.0,0.0,1.0,18.0] |
|(7,[0,5,6],[68000.0,1.0,39.0])    |
|[2000.0,1.0,1.0,0.0,0.0,1.0,16.0] |
|[75000.0,1.0,1.0,0.0,0.0,1.0,41.0]|
|(7,[0,4,6],[90000.0,1.0,47.0])    |
|[87000.0,0.0,0.0,1.0,0.0,1.0,38.0]|
+----------------------------------+
import org.apache.spark.ml.feature.StandardScaler
val scaler = new StandardScaler()
             .setInputCol("features")
             .setOutputCol("scaledFeatures")
             .setWithStd(true)
             .setWithMean(false)
val custDF7 = scaler.fit(custDF6).transform(custDF6)
custDF7.printSchema
root
 |-- customerid: integer (nullable = false)
 |-- income: integer (nullable = false)
 |-- maritalstatus: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- state: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- gender_idx: double (nullable = false)
 |-- state_idx: double (nullable = false)
 |-- maritalstatus_idx: double (nullable = false)
 |-- gender_enc: vector (nullable = true)
 |-- state_enc: vector (nullable = true)
 |-- maritalstatus_enc: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)
custDF7.select("scaledFeatures").show(8,65)
+-----------------------------------------------------------------+
| scaledFeatures                                                  |
+-----------------------------------------------------------------+
|[0.8144011366375091,1.8708286933869707,1.8708286933869707,0.0,...|
|(7,[0,2,6],[1.0109807213431148,1.8708286933869707,3.7319696616...|
|[0.1404139890754326,1.8708286933869707,0.0,2.160246899469287,0...|
|(7,[0,5,6],[1.9096302514258834,1.9321835661585918,3.1640612348...|
|[0.05616559563017304,1.8708286933869707,1.8708286933869707,0.0...|
|[2.106209836131489,1.8708286933869707,1.8708286933869707,0.0,0...|
|(7,[0,4,6],[2.5274518033577866,2.82842712474619,3.813099436871...|
|[2.443203409912527,0.0,0.0,2.160246899469287,0.0,1.93218356615...|
+-----------------------------------------------------------------+
// 我们将创建两个群集。
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans()
             .setFeaturesCol("scaledFeatures")
             .setPredictionCol("prediction")
             .setK(2)
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline()
               .setStages(Array(genderIndexer, stateIndexer,
               mstatusIndexer, encoder, assembler, scaler, kmeans))
val model = pipeline.fit(custDF)
val clusters = model.transform(custDF)
clusters.select("customerid","income","maritalstatus",
                "gender","state","age","prediction")
                .show
+----------+------+-------------+------+-----+---+----------+
|customerid|income|maritalstatus|gender|state|age|prediction|
+----------+------+-------------+------+-----+---+----------+
|       100| 29000|            M|     F|   CA| 25|         1|
|       101| 36000|            M|     M|   CA| 46|         0|
|       102|  5000|            S|     F|   NY| 18|         1|
|       103| 68000|            S|     M|   AZ| 39|         0|
|       104|  2000|            S|     F|   CA| 16|         1|
|       105| 75000|            S|     F|   CA| 41|         0|
|       106| 90000|            M|     M|   MA| 47|         0|
|       107| 87000|            S|     M|   NY| 38|         0|
+----------+------+-------------+------+-----+---+----------+
import org.apache.spark.ml.clustering.KMeansModel
val model = pipeline.stages.last.asInstanceOf[KMeansModel]
model.clusterCenters.foreach(println)
[1.9994952044341603,0.37416573867739417,0.7483314773547883,0.4320493798938574,0.565685424949238,1.159310139695155,3.4236765156588613]
[0.3369935737810382,1.8708286933869707,1.247219128924647,0.7200822998230956,0.0,1.288122377439061,1.5955522466340666]

清单 4-1使用 K 均值的客户细分示例
我们通过计算平方误差集合和（WSSSE）来评估我们的聚类。使用“肘部方法”检查 WSSSE 通常用于帮助确定最佳聚类数。弯头方法的工作原理是用 k 的一系列值拟合模型，并将其与 WSSSE 进行绘制。目视检查折线图，如果它类似于弯曲的手臂，则它在曲线上弯曲的点（“肘部”）指示k的最优值。

val wssse = model.computeCost（custDF）
wssse： Double = 32.09801038868844

评估聚类质量的另一种方法是计算轮廓系数分数。轮廓分数提供了一个指标，用于衡量一个聚类中的每个点与其他聚类中的点的接近程度。轮廓分数越大，群集的质量越好。接近 1 的分数表示点更接近聚类的质心。接近 0 的分数表示点更接近其他聚类，负值表示点可能已指定到错误的聚类。

import org.apache.spark.ml.evaluation.ClusteringEvaluator
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(clusters)
silhouette: Double = 0.6722088068201866

使用潜在狄利克雷分配（LDA）进行主题建模

潜在狄利克雷分配（LDA）由David M. Blei，安德鲁·吴和迈克尔·乔丹于2003年开发，尽管乔纳森·普里查德，马修·斯蒂芬斯和彼得·唐纳利在2000年也提出了用于群体遗传学的类似算法。应用于机器学习的 LDA 基于图形模型，是构建在 GraphX 上的 Spark MLlib 中包含的第一个算法。潜在狄利克雷分配广泛用于主题建模。主题模型自动派生一组文档中的主题（或主题）（图 4-3）。这些主题可用于基于内容的建议、文档分类、降维和特征化。

图 4-3使用LDA按主题对文档进行分组
尽管 Spark MLlib 具有广泛的文本挖掘和预处理功能，但它缺少大多数企业级 NLP 库中的几个功能。四例如词形还原，词干化和情感分析，仅举几例。在本章后面的主题建模示例中，我们将需要其中一些功能。现在是介绍斯坦福大学CoreNLP的好时机，用于约翰·斯诺实验室的火花和火花NLP。

Stanford CoreNLP for Spark

斯坦福大学CoreNLP是由斯坦福大学NLP研究小组开发的专业级NLP库。CoreNLP 支持多种语言，如阿拉伯语、中文、英语、法语和德语。在它提供了一个本机 Java API，以及一个 Web API 和命令行界面。还有适用于主要编程语言的第三方API，例如R，Python，Ruby和Lua。来自Databricks的软件工程师孟向瑞为Spark开发了斯坦福CoreNLP包装器（参见清单4-2）。

spark-shell --packages databricks:spark-corenlp:0.4.0-spark2.4-scala2.11 --jars stanford-corenlp-3.9.1-models.jar
import spark.implicits._
import org.apache.spark.sql.types._
val dataDF = Seq(
(1, "Kevin Durant was the 2019 All-Star NBA Most Valuable Player."),
(2, "Stephen Curry is the best clutch three-point shooter in the NBA."),
(3, "My game is not as good as it was 20 years ago."),
(4, "Michael Jordan is the greatest NBA player of all time."),
(5, "The Lakers currently have one of the worst performances in the NBA."))
.toDF("id", "text")
dataDF.show(false)
+---+-------------------------------------------------------------------+
|id |text                                                               |
+---+-------------------------------------------------------------------+
|1  |Kevin Durant was the 2019 All-Star NBA Most Valuable Player.       |
|2  |Stephen Curry is the best clutch three-point shooter in the NBA.   |
|3  |My game is not as good as it was 20 years ago.                     |
|4  |Michael Jordan is the greatest NBA player of all time.             |
|5  |The Lakers currently have one of the worst performances in the NBA.|
+---+-------------------------------------------------------------------+
// 斯坦福大学 CoreNLP 允许您链接文本处理功能。让我们分开
// 将文档分成句子，然后将句子标记成单词。
import com.databricks.spark.corenlp.functions._
val dataDF2 = dataDF
              .select(explode(ssplit('text)).as('sen))
              .select('sen, tokenize('sen).as('words))
dataDF2.show(5,30)
+------------------------------+------------------------------+
|                           sen|                         words|
+------------------------------+------------------------------+
|Kevin Durant was the 2019 A...|[Kevin, Durant, was, the, 2...|
|Stephen Curry is the best c...|[Stephen, Curry, is, the, b...|
|My game is not as good as i...|[My, game, is, not, as, goo...|
|Michael Jordan is the great...|[Michael, Jordan, is, the, ...|
|The Lakers currently have o...|[The, Lakers, currently, ha...|
+------------------------------+------------------------------+
// 对句子执行情感分析。规模
// 范围从 0 表示强负值到 4 表示强正数
val dataDF3 = dataDF
              .select(explode(ssplit('text)).as('sen))
              .select('sen, tokenize('sen).as('words), sentiment('sen).as('sentiment))
dataDF3.show(5,30)
+------------------------------+------------------------------+---------+
|                           sen|                         words|sentiment|
+------------------------------+------------------------------+---------+
|Kevin Durant was the 2019 A...|[Kevin, Durant, was, the, 2...|        1|
|Stephen Curry is the best c...|[Stephen, Curry, is, the, b...|        3|
|My game is not as good as i...|[My, game, is, not, as, goo...|        1|
|Michael Jordan is the great...|[Michael, Jordan, is, the, ...|        3|
|The Lakers currently have o...|[The, Lakers, currently, ha...|        1|
+------------------------------+------------------------------+---------+

清单 4-2A Brief Introduction to Stanford CoreNLP for Spark
访问数据砖的“核心网”GitHub 页面，获取斯坦福大学星火核心网中可用功能的完整列表。

Spark NLP from John Snow Labs

约翰·斯诺实验室的火花 NLP 库本身支持火花 ML 管道 API。它是用Scala编写的，包括Scala和Python API。它包括几个高级功能，例如分词器，词形还原器，词干分析器，实体和日期提取器，词性标记器，句子边界检测，拼写检查器和命名实体识别，仅举几例。

注释器在火花 NLP 中提供 NLP 功能。注释是火花 NLP 操作的结果。有两种类型的注释器，注释器方法和注释器模型。注释器方法表示火花 MLlib 估计器。它使用数据拟合模型以生成注释器模型或转换器。注释器模型是一个转换器，它采用数据集并添加一个包含注释结果的列。由于注释器表示为 Spark 估计器和转换器，因此可以轻松地将注释器与 Spark 管道 API 集成。Spark NLP为用户提供了几种访问其功能的方法。我们

预训练的管道

Spark NLP 包括用于快速文本注释的预训练管道。Spark NLP 提供了一个名为 explain_document_ml 的预训练管道，该管道接受文本作为输入（参见清单 4-3）。预先训练的管道包含流行的文本处理功能，并提供了一种快速而肮脏的方法来使用Spark NLP，而不会有太多的麻烦。

spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val annotations = PretrainedPipeline("explain_document_ml").annotate("I visited Greece last summer. It was a great trip. I went swimming in Mykonos.")
annotations("sentence")
res7: Seq[String] = List(I visited Greece last summer.,
It was a great trip., I went swimming in Mykonos.)
annotations("token")
res8: Seq[String] = List(I, visited, Greece, last, summer, .,
It, was, a, great, trip, ., I, went, swimming, in, Mykonos, .)
annotations("lemma")
res9: Seq[String] = List(I, visit, Greece, last, summer, .,
It, be, a, great, trip, ., I, go, swim, in, Mykonos, .)

清单 4-3火花 NLP 预训练流水线示例

具有 Spark 数据帧的预训练管道

预训练的管道还可与 Spark 数据帧配合使用，如清单 4-4 所示。

val data = Seq("I visited Greece last summer. It was a great trip. I went swimming in Mykonos.").toDF("text")
val annotations = PretrainedPipeline("explain_document_ml").transform(data)
annotations.show()
+--------------------+--------------------+--------------------+
|                text|            document|            sentence|
+--------------------+--------------------+--------------------+
|I visited Greece ...|[[document, 0, 77...|[[document, 0, 28...|
+--------------------+--------------------+--------------------+
+--------------------+
|               token|
+--------------------+
|[[token, 0, 0, I,...|
+--------------------+
+--------------------+--------------------+--------------------+
|             checked|               lemma|                stem|
+--------------------+--------------------+--------------------+
|[[token, 0, 0, I,...|[[token, 0, 0, I,...|[[token, 0, 0, i,...|
+--------------------+--------------------+--------------------+
+--------------------+
|                 pos|
+--------------------+
|[[pos, 0, 0, PRP,...|
+--------------------+

清单 4-4具有火花数据帧的火花 NLP 预训练管道

使用Spark MLlib 管道的预训练管道

可以将预训练的管道与 Spark MLlib 管道一起使用（参见清单 4-5）。请注意，需要一个名为 Finisher 的特殊转换器以人类可读的格式显示令牌。

import com.johnsnowlabs.nlp.Finisher
import org.apache.spark.ml.Pipeline
val data = Seq("I visited Greece last summer. It was a great trip. I went swimming in Mykonos.").toDF("text")
val finisher = new Finisher()
               .setInputCols("sentence", "token", "lemma")
val explainPipeline = PretrainedPipeline("explain_document_ml").model
val pipeline = new Pipeline()
               .setStages(Array(explainPipeline,finisher))
pipeline.fit(data).transform(data).show(false)
+--------------------------------------------------+
|text                                              |
+--------------------------------------------------+
|I visited Greece last summer. It was a great trip.|
+--------------------------------------------------+
+----------------------------+
| text                       |
+----------------------------+
|I went swimming in Mykonos. |
+----------------------------+
+-----------------------------------------------------+
|finished_sentence                                    |
+-----------------------------------------------------+
|[I visited Greece last summer., It was a great trip. |
+-----------------------------------------------------+
+-----------------------------+
| finished_sentence           |
+-----------------------------+
|,I went swimming in Mykonos.]|
+-----------------------------+
+-----------------------------------------------------------------+
|finished_token                                                   |
+-----------------------------------------------------------------+
|[I, visited, Greece, last, summer, ., It, was, a, great, trip, .,|
+-----------------------------------------------------------------+
+--------------------------------------------------------------+
|finished_lemma                                                |
+--------------------------------------------------------------+
|[I, visit, Greece, last, summer, ., It, be, a, great, trip, .,|
+--------------------------------------------------------------+
+--------------------------------+
| finished_lemma                 |
+--------------------------------+
|, I, go, swim, in, Mykonos, .]  |
+--------------------------------+

清单 4-5带有 Spark MLlib 流水线的预训练流水线示例

创建您自己的Spark MLlib 管道

您可以直接从自己的 Spark MLlib 管道中使用注释器，如清单 4-6 所示。

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val data = Seq("I visited Greece last summer. It was a great trip. I went swimming in Mykonos.").toDF("text")
val documentAssembler = new DocumentAssembler()
                        .setInputCol("text")
                        .setOutputCol("document")
val sentenceDetector = new SentenceDetector()
                       .setInputCols(Array("document"))
                       .setOutputCol("sentence")
val regexTokenizer = new Tokenizer()
                     .setInputCols(Array("sentence"))
                     .setOutputCol("token")
val finisher = new Finisher()
               .setInputCols("token")
               .setCleanAnnotations(false)
val pipeline = new Pipeline()
               .setStages(Array(documentAssembler,
               sentenceDetector,regexTokenizer,finisher))
pipeline.fit(Seq.empty[String].toDF("text"))
        .transform(data)
        .show()
+--------------------+--------------------+--------------------+
|                text|            document|            sentence|
+--------------------+--------------------+--------------------+
|I visited Greece ...|[[document, 0, 77...|[[document, 0, 28...|
+--------------------+--------------------+--------------------+
+--------------------+--------------------+
|               token|      finished_token|
+--------------------+--------------------+
|[[token, 0, 0, I,...|[I, visited, Gree...|
+--------------------+--------------------+

清单 4-6创建自己的 Spark MLlib 管道示例

Spark NLP LightPipeline

SparkNLP提供了另一类称为光管道的管道。它类似于 Spark MLlib 管道，但不是利用 Spark 的分布式处理功能，而是在本地执行。当处理少量数据并且需要低延迟执行时，LightPipeline 是合适的（参见清单 4-7）。

import com.johnsnowlabs.nlp.base._
val trainedModel = pipeline.fit(Seq.empty[String].toDF("text"))
val lightPipeline = new LightPipeline(trainedModel)
lightPipeline.annotate("I visited Greece last summer.")

清单 4-7Spark NLP LightPipelines Example

Spark NLP OCR Module

Spark NLP 包括一个 OCR 模块，允许用户从 PDF 文件创建火花数据帧。OCR 模块未包含在核心Spark NLP 库中。要使用它，您需要包含一个单独的包并指定一个额外的存储库，如清单 4-8 中的 spark-shell 命令所示。

spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.1.0,javax.media.jai:com.springsource.javax.media.jai.core:1.1.3
      --repositories http://repo.spring.io/plugins-release
import com.johnsnowlabs.nlp.util.io.OcrHelper
val myOcrHelper = new OcrHelper
val data = myOcrHelper.createDataset(spark, "/my_pdf_files/")
val documentAssembler = new DocumentAssembler().setInputCol("text")
documentAssembler.transform(data).select("text","filename").show(1,45)
+------------------------------------------+
|                                      text|
+------------------------------------------+
|this is a PDF document. Have a great day. |
+------------------------------------------+
+--------------------------------------------+
|                                   filename |
+--------------------------------------------+
|file:/my_pdf_files/document.pdf             |
+--------------------------------------------+

清单 4-8火花 NLP OCR 模块示例
Spark NLP 是一个功能强大的库，其中包含本简介中未涉及的更多功能。要了解有关火花 NLP 的更多信息，请访问 http://nlp.johnsnowlabs.com。

例子

现在，我们可以继续我们的主题建模示例，因为我们拥有了所需的一切。我们将使用潜在狄利克雷分配，按 15 年内发布的主题对超过 100 万个新闻标题进行分类。该数据集可以从Kaggle下载，由澳大利亚广播公司提供，并由罗希特·库尔卡尼提供。

我们可以使用约翰·斯诺实验室的火花 NLP 或斯坦福大学 CoreNLP 软件包来为我们提供额外的文本处理功能。对于此示例，我们将使用斯坦福 CoreNLP 包（参见清单 4-9）。

spark-shell --packages databricks:spark-corenlp:0.4.0-spark2.4-scala2.11 --jars stanford-corenlp-3.9.1-models.jar
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// 定义架构。
var newsSchema = StructType(Array (
StructField("publish_date",   IntegerType, true),
StructField("headline_text",   StringType, true)
    ))
// 读取数据。
val dataDF = spark.read.format("csv")
             .option("header", "true")
             .schema(newsSchema)
             .load("abcnews-date-text.csv")
// 检查数据。
dataDF.show(false)
+------------+--------------------------------------------------+
|publish_date|headline_text                                     |
+------------+--------------------------------------------------+
|20030219    |aba decides against community broadcasting licence|
|20030219    |act fire witnesses must be aware of defamation    |
|20030219    |a g calls for infrastructure protection summit    |
|20030219    |air nz staff in aust strike for pay rise          |
|20030219    |air nz strike to affect australian travellers     |
|20030219    |ambitious olsson wins triple jump                 |
|20030219    |antic delighted with record breaking barca        |
|20030219    |aussie qualifier stosur wastes four memphis match |
|20030219    |aust addresses un security council over iraq      |
|20030219    |australia is locked into war timetable opp        |
|20030219    |australia to contribute 10 million in aid to iraq |
|20030219    |barca take record as robson celebrates birthday in|
|20030219    |bathhouse plans move ahead                        |
|20030219    |big hopes for launceston cycling championship     |
|20030219    |big plan to boost paroo water supplies            |
|20030219    |blizzard buries united states in bills            |
|20030219    |brigadier dismisses reports troops harassed in    |
|20030219    |british combat troops arriving daily in kuwait    |
|20030219    |bryant leads lakers to double overtime win        |
|20030219    |bushfire victims urged to see centrelink          |
+------------+--------------------------------------------------+
only showing top 20 rows
// 删除标点符号。
val dataDF2 = dataDF
              .withColumn("headline_text",
              regexp_replace((dataDF("headline_text")), "[^a-zA-Z0-9 ]", ""))
// 我们将使用斯坦福CoreNLP来执行词形还原。如前所述
// 早些时候，词形还原推导出屈折词的根形式。为
// 例如, "camping", "camps", "camper", and "camped" 都是变形的
// forms of "camp". 的形式。将屈折的单词减少到其根形式有助于降低 
// 执行自然语言处理的复杂性。类似的
// 称为词干的过程也会将词形变化的单词减少到其根部
// 形式，但它通过粗暴地切掉词缀来做到这一点，即使
// 根形式可能不是有效的单词。相反，词形还原确保
// 通过
// 单词的形态分析和词汇的使用。
import com.databricks.spark.corenlp.functions._
val dataDF3 = dataDF2
.select(explode(ssplit('headline_text)).as('sen))
              .select('sen, lemma('sen)
              .as('words))
dataDF3.show
+--------------------+--------------------+
|                 sen|               words|
+--------------------+--------------------+
|aba decides again...|[aba, decide, aga...|
|act fire witnesse...|[act, fire, witne...|
|a g calls for inf...|[a, g, call, for,...|
|air nz staff in a...|[air, nz, staff, ...|
|air nz strike to ...|[air, nz, strike,...|
|ambitious olsson ...|[ambitious, olsso...|
|antic delighted w...|[antic, delighted...|
|aussie qualifier ...|[aussie, qualifie...|
|aust addresses un...|[aust, address, u...|
|australia is lock...|[australia, be, l...|
|australia to cont...|[australia, to, c...|
|barca take record...|[barca, take, rec...|
|bathhouse plans m...|[bathhouse, plan,...|
|big hopes for lau...|[big, hope, for, ...|
|big plan to boost...|[big, plan, to, b...|
|blizzard buries u...|[blizzard, bury, ...|
|brigadier dismiss...|[brigadier, dismi...|
|british combat tr...|[british, combat,...|
|bryant leads lake...|[bryant, lead, la...|
|bushfire victims ...|[bushfire, victim...|
+--------------------+--------------------+
only showing top 20 rows
// 我们将删除非索引字，例如“a”、“be”和“to”。停
// 单词对文档的含义没有贡献。
import org.apache.spark.ml.feature.StopWordsRemover
val remover = new StopWordsRemover()
              .setInputCol("words")
              .setOutputCol("filtered_stopwords")
val dataDF4 = remover.transform(dataDF3)
dataDF4.show
+--------------------+--------------------+--------------------+
|                 sen|               words|  filtered_stopwords|
+--------------------+--------------------+--------------------+
|aba decides again...|[aba, decide, aga...|[aba, decide, com...|
|act fire witnesse...|[act, fire, witne...|[act, fire, witne...|
|a g calls for inf...|[a, g, call, for,...|[g, call, infrast...|
|air nz staff in a...|[air, nz, staff, ...|[air, nz, staff, ...|
|air nz strike to ...|[air, nz, strike,...|[air, nz, strike,...|
|ambitious olsson ...|[ambitious, olsso...|[ambitious, olsso...|
|antic delighted w...|[antic, delighted...|[antic, delighted...|
|aussie qualifier ...|[aussie, qualifie...|[aussie, qualifie...|
|aust addresses un...|[aust, address, u...|[aust, address, u...|
|australia is lock...|[australia, be, l...|[australia, lock,...|
|australia to cont...|[australia, to, c...|[australia, contr...|
|barca take record...|[barca, take, rec...|[barca, take, rec...|
|bathhouse plans m...|[bathhouse, plan,...|[bathhouse, plan,...|
|big hopes for lau...|[big, hope, for, ...|[big, hope, launc...|
|big plan to boost...|[big, plan, to, b...|[big, plan, boost...|
|blizzard buries u...|[blizzard, bury, ...|[blizzard, bury, ...|
|brigadier dismiss...|[brigadier, dismi...|[brigadier, dismi...|
|british combat tr...|[british, combat,...|[british, combat,...|
|bryant leads lake...|[bryant, lead, la...|[bryant, lead, la...|
|bushfire victims ...|[bushfire, victim...|[bushfire, victim...|
+--------------------+--------------------+--------------------+
only showing top 20 rows
// 生成 n 元语法。n-gram是一个“n”个单词的序列，通常
// 用于发现文档中单词的关系。例如
// “Los Angeles” is a bigram. “Los” and “Angeles” are unigrams. “Los” and 
// “Angeles” 用于发现文档中单词的关系。例如 
// 作为一个单一实体 “Los Angeles”.组合时更有意义。
// 确定“n”的最佳数量取决于用例
// 和文档中使用的语言。对于我们的示例，我们将生成 
// 一个单字母、双字母和三元组。
import org.apache.spark.ml.feature.NGram
val unigram = new NGram()
              .setN(1)
              .setInputCol("filtered_stopwords")
              .setOutputCol("unigram_words")
val dataDF5 = unigram.transform(dataDF4)
dataDF5.printSchema
root
 |-- sen: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered_stopwords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- unigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
val bigram = new NGram()
             .setN(2)
             .setInputCol("filtered_stopwords")
             .setOutputCol("bigram_words")
val dataDF6 = bigram.transform(dataDF5)
dataDF6.printSchema
root
 |-- sen: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered_stopwords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- unigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- bigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
val trigram = new NGram()
              .setN(3)
              .setInputCol("filtered_stopwords")
              .setOutputCol("trigram_words")
val dataDF7 = trigram.transform(dataDF6)
dataDF7.printSchema
root
 |-- sen: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered_stopwords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- unigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- bigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- trigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
// 我们将单字，双字母和三元组组合成一个词汇表。
// 我们将连接单词并将其存储在“ngram_words”列中
// 使用 Spark SQL.
dataDF7.createOrReplaceTempView("dataDF7")
val dataDF8 = spark.sql("select sen,words,filtered_stopwords,unigram_words,bigram_words,trigram_words,concat(concat(unigram_words,bigram_words),trigram_words) as ngram_words from dataDF7")
dataDF8.printSchema
root
 |-- sen: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered_stopwords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- unigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- bigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- trigram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- ngram_words: array (nullable = true)
 |    |-- element: string (containsNull = false)
dataDF8.select("ngram_words").show(20,65)
+-----------------------------------------------------------------+
|ngram_words                                                      |
+-----------------------------------------------------------------+
|[aba, decide, community, broadcasting, licence, aba decide, de...|
|[act, fire, witness, must, aware, defamation, act fire, fire w...|
|[g, call, infrastructure, protection, summit, g call, call inf...|
|[air, nz, staff, aust, strike, pay, rise, air nz, nz staff, st...|
|[air, nz, strike, affect, australian, traveller, air nz, nz st...|
|[ambitious, olsson, win, triple, jump, ambitious olsson, olsso...|
|[antic, delighted, record, break, barca, antic delighted, deli...|
|[aussie, qualifier, stosur, waste, four, memphis, match, aussi...|
|[aust, address, un, security, council, iraq, aust address, add...|
|[australia, lock, war, timetable, opp, australia lock, lock wa...|
|[australia, contribute, 10, million, aid, iraq, australia cont...|
|[barca, take, record, robson, celebrate, birthday, barca take,...|
|[bathhouse, plan, move, ahead, bathhouse plan, plan move, move...|
|[big, hope, launceston, cycling, championship, big hope, hope ...|
|[big, plan, boost, paroo, water, supplies, big plan, plan boos...|
|[blizzard, bury, united, state, bill, blizzard bury, bury unit...|
|[brigadier, dismiss, report, troops, harass, brigadier dismiss...|
|[british, combat, troops, arrive, daily, kuwait, british comba...|
|[bryant, lead, laker, double, overtime, win, bryant lead, lead...|
|[bushfire, victim, urge, see, centrelink, bushfire victim, vic...|
+-----------------------------------------------------------------+
only showing top 20 rows
// 使用计数矢量化程序将文本数据转换为令牌计数的向量。
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val cv = new CountVectorizer()
         .setInputCol("ngram_words")
         .setOutputCol("features")
val cvModel = cv.fit(dataDF8)
val dataDF9 = cvModel.transform(dataDF8)
val vocab = cvModel.vocabulary
vocab: Array[String] = Array(police, man, new, say, plan, charge, call,
council, govt, fire, court, win, interview, back, kill, australia, find,
death, urge, face, crash, nsw, report, water, get, australian, qld, take,
woman, wa, attack, sydney, year, change, murder, hit, health, jail, claim,
day, child, miss, hospital, car, home, sa, help, open, rise, warn, school,
world, market, cut, set, accuse, die, seek, drug, make, boost, may, coast,
government, ban, job, group, fear, mp, two, talk, service, farmer, minister, election, fund, south, road, continue, lead, worker, first, national, test, arrest, work, rural, go, power, price, cup, final, concern, green, china, mine, fight, labor, trial, return, flood, deal, north, case, push, pm, melbourne, law, driver, one, nt, want, centre, record, ...
// 我们使用 IDF 来缩放计数矢量化器生成的特征。
// 缩放功能通常会提高性能。
import org.apache.spark.ml.feature.IDF
val idf = new IDF()
          .setInputCol("features")
          .setOutputCol("features2")
val idfModel = idf.fit(dataDF9)
val dataDF10 = idfModel.transform(dataDF9)
dataDF10.select("features2").show(20,65)
+-----------------------------------------------------------------+
| features2                                                       |
+-----------------------------------------------------------------+
|(262144,[154,1054,1140,15338,19285],[5.276861439995834,6.84427...|
|(262144,[9,122,711,727,3141,5096,23449],[4.189486226673463,5.1...|
|(262144,[6,734,1165,1177,1324,43291,96869],[4.070620900306447,...|
|(262144,[48,121,176,208,321,376,424,2183,6231,12147,248053],[4...|
|(262144,[25,176,208,376,764,3849,12147,41079,94670,106284],[4....|
|(262144,[11,1008,1743,10833,128493,136885],[4.2101466208496285...|
|(262144,[113,221,3099,6140,9450,16643],[5.120230688038215,5.54...|
|(262144,[160,259,483,633,1618,4208,17750,187744],[5.3211036079...|
|(262144,[7,145,234,273,321,789,6163,10334,11101,32988],[4.0815...|
|(262144,[15,223,1510,5062,5556],[4.393970862600795,5.555011224...|
|(262144,[15,145,263,372,541,3896,15922,74174,197210],[4.393970...|
|(262144,[27,113,554,1519,3099,13499,41664,92259],[4.5216508634...|
|(262144,[4,131,232,5636,6840,11444,37265],[3.963488754657374,5...|
|(262144,[119,181,1288,1697,2114,49447,80829,139670],[5.1266204...|
|(262144,[4,23,60,181,2637,8975,9664,27571,27886],[3.9634887546...|
|(262144,[151,267,2349,3989,7631,11862],[5.2717309555002725,5.6...|
|(262144,[22,513,777,12670,33787,49626],[4.477068652869369,6.16...|
|(262144,[502,513,752,2211,5812,7154,30415,104812],[6.143079025...|
|(262144,[11,79,443,8222,8709,11447,194715],[4.2101466208496285...|
|(262144,[18,146,226,315,2877,5160,19389,42259],[4.414350240692...|
+-----------------------------------------------------------------+
only showing top 20 rows
// 然后可以将缩放的特征传递给 LDA 。
import org.apache.spark.ml.clustering.LDA
val lda = new LDA()
          .setK(30)
          .setMaxIter(10)
val model = lda.fit(dataDF10)
val topics = model.describeTopics
topics.show(20,30)
+-----+------------------------------+------------------------------+
|topic|                   termIndices|                   termWeights|
+-----+------------------------------+------------------------------+
|    0|[2, 7, 16, 9482, 9348, 5, 1...|[1.817876125380732E-4, 1.09...|
|    1|[974, 2, 3, 5189, 5846, 541...|[1.949552388785536E-4, 1.89...|
|    2|[2253, 4886, 12, 6767, 3039...|[2.7922272919208327E-4, 2.4...|
|    3|[6218, 6313, 5762, 3387, 27...|[1.6618313204146235E-4, 1.6...|
|    4|[0, 1, 39, 14, 13, 11, 2, 1...|[1.981809243111437E-4, 1.22...|
|    5|[4, 7, 22, 11, 2, 3, 79, 92...|[2.49620962563534E-4, 2.032...|
|    6|[15, 32, 319, 45, 342, 121,...|[2.885684164769467E-5, 2.45...|
|    7|[2298, 239, 1202, 3867, 431...|[3.435238376348344E-4, 3.30...|
|    8|[0, 4, 110, 3, 175, 38, 8, ...|[1.0177738516279581E-4, 8.7...|
|    9|[1, 19, 10, 2, 7, 8, 5, 0, ...|[2.2854683602607976E-4, 1.4...|
|   10|[1951, 1964, 16, 33, 1, 5, ...|[1.959705576881449E-4, 1.92...|
|   11|[12, 89, 72, 3, 92, 63, 62,...|[4.167255720848278E-5, 3.19...|
|   12|[4, 23, 13, 22, 73, 18, 70,...|[1.1641833113477034E-4, 1.1...|
|   13|[12, 1, 5, 16, 185, 132, 24...|[0.008769073702733892, 0.00...|
|   14|[9151, 13237, 3140, 14, 166...|[8.201099412213086E-5, 7.85...|
|   15|[9, 1, 0, 11, 3, 15, 32, 52...|[0.0032039727688580703, 0.0...|
|   16|[1, 10, 5, 56, 27, 3, 16, 1...|[5.252120584885086E-5, 4.05...|
|   17|[12, 1437, 4119, 1230, 5303...|[5.532790361864421E-4, 2.97...|
|   18|[12, 2459, 7836, 8853, 7162...|[6.862552774818539E-4, 1.83...|
|   19|[21, 374, 532, 550, 72, 773...|[0.0024665346250921432, 0.0...|
+-----+------------------------------+------------------------------+
only showing top 20 rows
// 确定词汇表的最大大小。
model.vocabSize
res27: Int = 262144
// Extract the topic words. The describeTopics method returns the
// dictionary indices from CountVectorizer's output. We will use a custom // user-defined function to map the words to the indices.ix
import scala.collection.mutable.WrappeddArray
import org.apache.spark.sql.functions.udf
val extractWords = udf( (x : WrappedArray[Int]) => { x.map(i => vocab(i)) })
val topics = model
             .describeTopics
             .withColumn("words", extractWords(col("termIndices")))
topics.select("topic","termIndices","words").show(20,30)
+-----+------------------------------+------------------------------+
|topic|                   termIndices|                         words|
+-----+------------------------------+------------------------------+
|    0|[2, 7, 16, 9482, 9348, 5, 1...|[new, council, find, abuse ...|
|    1|[974, 2, 3, 5189, 5846, 541...|[2016, new, say, china sea,...|
|    2|[2253, 4886, 12, 6767, 3039...|[nathan, interview nathan, ...|
|    3|[6218, 6313, 5762, 3387, 27...|[new guinea, papua new guin...|
|    4|[0, 1, 39, 14, 13, 11, 2, 1...|[police, man, day, kill, ba...|
|    5|[4, 7, 22, 11, 2, 3, 79, 92...|[plan, council, report, win...|
|    6|[15, 32, 319, 45, 342, 121,...|[australia, year, india, sa...|
|    7|[2298, 239, 1202, 3867, 431...|[sach, tour, de, tour de, d...|
|    8|[0, 4, 110, 3, 175, 38, 8, ...|[police, plan, nt, say, fun...|
|    9|[1, 19, 10, 2, 7, 8, 5, 0, ...|[man, face, court, new, cou...|
|   10|[1951, 1964, 16, 33, 1, 5, ...|[vic country, vic country h...|
|   11|[12, 89, 72, 3, 92, 63, 62,...|[interview, price, farmer, ...|
|   12|[4, 23, 13, 22, 73, 18, 70,...|[plan, water, back, report,...|
|   13|[12, 1, 5, 16, 185, 132, 24...|[interview, man, charge, fi...|
|   14|[9151, 13237, 3140, 14, 166...|[campese, interview terry, ...|
|   15|[9, 1, 0, 11, 3, 15, 32, 52...|[fire, man, police, win, sa...|
|   16|[1, 10, 5, 56, 27, 3, 16, 1...|[man, court, charge, die, t...|
|   17|[12, 1437, 4119, 1230, 5303...|[interview, redback, 666, s...|
|   18|[12, 2459, 7836, 8853, 7162...|[interview, simon, intervie...|
|   19|[21, 374, 532, 550, 72, 773...|[nsw, asylum, seeker, asylu...|
+-----+------------------------------+------------------------------+
only showing top 20 rows
// 从描述主题中提取术语权重。
val wordsWeight = udf( (x : WrappedArray[Int],
y : WrappedArray[Double]) =>
{ x.map(i => vocab(i)).zip(y)}
)
val topics2 = model
              .describeTopics
              .withColumn("words", wordsWeight(col("termIndices"), col("termWeights")))
val topics3 = topics2
              .select("topic", "words")
              .withColumn("words", explode(col("words")))
topics3.show(50,false)
+-----+------------------------------------------------+
|topic|words                                           |
+-----+------------------------------------------------+
|0    |[new, 1.4723785654465323E-4]                    |
|0    |[council, 1.242876719889358E-4]                 |
|0    |[thursday, 1.1710009304019913E-4]               |
|0    |[grandstand thursday, 1.0958369194828903E-4]    |
|0    |[two, 8.119593156862581E-5]                     |
|0    |[charge, 7.321024120305904E-5]                  |
|0    |[find, 6.98723717903146E-5]                     |
|0    |[burley griffin, 6.474176573486395E-5]          |
|0    |[claim, 6.448801852215021E-5]                   |
|0    |[burley, 6.390953777977556E-5]                  |
|1    |[say, 1.9595383103126804E-4]                    |
|1    |[new, 1.7986957579978078E-4]                    |
|1    |[murder, 1.7156446166835784E-4]                 |
|1    |[las, 1.6793241095301546E-4]                    |
|1    |[vegas, 1.6622904053495525E-4]                  |
|1    |[las vegas, 1.627321199362179E-4]               |
|1    |[2016, 1.4906599207615762E-4]                   |
|1    |[man, 1.3653760511354596E-4]                    |
|1    |[call, 1.3277357539424398E-4]                   |
|1    |[trump, 1.250570735309821E-4]                   |
|2    |[ntch, 5.213678388314454E-4]                    |
|2    |[ntch podcast, 4.6907569870744537E-4]           |
|2    |[podcast, 4.625754070258578E-4]                 |
|2    |[interview, 1.2297477650126824E-4]              |
|2    |[trent, 9.319817855283612E-5]                   |
|2    |[interview trent, 8.967384560094343E-5]         |
|2    |[trent robinson, 7.256857525120274E-5]          |
|2    |[robinson, 6.888930961680287E-5]                |
|2    |[interview trent robinson, 6.821800839623336E-5]|
|2    |[miss, 6.267572268770148E-5]                    |
|3    |[new, 8.244153432249302E-5]                     |
|3    |[health, 5.269269109549137E-5]                  |
|3    |[change, 5.1481361386635024E-5]                 |
|3    |[first, 3.474601129571304E-5]                   |
|3    |[south, 3.335342687995096E-5]                   |
|3    |[rise, 3.3245575277669534E-5]                   |
|3    |[country, 3.26422466284622E-5]                  |
|3    |[abuse, 3.25594250748893E-5]                    |
|3    |[start, 3.139959761950907E-5]                   |
|3    |[minister, 3.1327427652213426E-5]               |
|4    |[police, 1.756612187665565E-4]                  |
|4    |[man, 1.2903801461819285E-4]                    |
|4    |[petero, 8.259870531430337E-5]                  |
|4    |[kill, 8.251557569137285E-5]                    |
|4    |[accuse grant, 8.187325944352362E-5]            |
|4    |[accuse grant bail, 7.609807356711693E-5]       |
|4    |[find, 7.219731162848223E-5]                    |
|4    |[attack, 6.804063612991027E-5]                  |
|4    |[day, 6.772554893634948E-5]                     |
|4    |[jail, 6.470525327671485E-5]                    |
+-----+------------------------------------------------+
only showing top 50 rows
// 最后，我们将单词和权重拆分为单独的字段。
val topics4 = topics3
              .select(col("topic"), col("words")
              .getField("_1").as("word"), col("words")
              .getField("_2").as("weight"))
topics4.show(50, false)
+-----+------------------------+---------------------+
|topic|word                    |weight               |
+-----+------------------------+---------------------+
|0    |new                     |1.4723785654465323E-4|
|0    |council                 |1.242876719889358E-4 |
|0    |thursday                |1.1710009304019913E-4|
|0    |grandstand thursday     |1.0958369194828903E-4|
|0    |two                     |8.119593156862581E-5 |
|0    |charge                  |7.321024120305904E-5 |
|0    |find                    |6.98723717903146E-5  |
|0    |burley griffin          |6.474176573486395E-5 |
|0    |claim                   |6.448801852215021E-5 |
|0    |burley                  |6.390953777977556E-5 |
|1    |say                     |1.9595383103126804E-4|
|1    |new                     |1.7986957579978078E-4|
|1    |murder                  |1.7156446166835784E-4|
|1    |las                     |1.6793241095301546E-4|
|1    |vegas                   |1.6622904053495525E-4|
|1    |las vegas               |1.627321199362179E-4 |
|1    |2016                    |1.4906599207615762E-4|
|1    |man                     |1.3653760511354596E-4|
|1    |call                    |1.3277357539424398E-4|
|1    |trump                   |1.250570735309821E-4 |
|2    |ntch                    |5.213678388314454E-4 |
|2    |ntch podcast            |4.6907569870744537E-4|
|2    |podcast                 |4.625754070258578E-4 |
|2    |interview               |1.2297477650126824E-4|
|2    |trent                   |9.319817855283612E-5 |
|2    |interview trent         |8.967384560094343E-5 |
|2    |trent robinson          |7.256857525120274E-5 |
|2    |robinson                |6.888930961680287E-5 |
|2    |interview trent robinson|6.821800839623336E-5 |
|2    |miss                    |6.267572268770148E-5 |
|3    |new                     |8.244153432249302E-5 |
|3    |health                  |5.269269109549137E-5 |
|3    |change                  |5.1481361386635024E-5|
|3    |first                   |3.474601129571304E-5 |
|3    |south                   |3.335342687995096E-5 |
|3    |rise                    |3.3245575277669534E-5|
|3    |country                 |3.26422466284622E-5  |
|3    |abuse                   |3.25594250748893E-5  |
|3    |start                   |3.139959761950907E-5 |
|3    |minister                |3.1327427652213426E-5|
|4    |police                  |1.756612187665565E-4 |
|4    |man                     |1.2903801461819285E-4|
|4    |petero                  |8.259870531430337E-5 |
|4    |kill                    |8.251557569137285E-5 |
|4    |accuse grant            |8.187325944352362E-5 |
|4    |accuse grant bail       |7.609807356711693E-5 |
|4    |find                    |7.219731162848223E-5 |
|4    |attack                  |6.804063612991027E-5 |
|4    |day                     |6.772554893634948E-5 |
|4    |jail                    |6.470525327671485E-5 |
+-----+------------------------+---------------------+
only showing top 50 rows

清单 4-9使用 LDA 进行主题建模
为简洁起见，我只显示前50行，显示30个主题中的4个。如果您仔细检查每个主题中的单词，您将看到可用于对标题进行分类的重复主题。

使用隔离林进行异常检测

异常或异常值检测可识别显著偏离且从大多数数据集中脱颖而出的罕见观测值。它经常用于发现欺诈性金融交易，识别网络安全威胁或执行预测性维护，仅举几个用例。异常检测是机器学习领域的热门研究领域。多年来，已经发明了几种具有不同程度有效性的异常检测技术。在本章中，我将介绍一种最有效的异常检测技术，称为隔离林。隔离森林是由刘飞、凯明婷、周志华共同开发的一种基于树木的异常检测集成算法。x

与大多数异常情况检测技术不同，隔离林尝试显式检测实际异常值，而不是标识正常数据点。隔离林的运行基于这样一个事实，即数据集中通常存在少量异常值，因此容易受到隔离过程的影响。西将异常值与正常数据点隔离是有效的，因为它需要更少的条件。相比之下，隔离正常数据点通常涉及更多条件。如图4-4（b）所示，异常数据点仅用一个除法隔离，而正常数据点需要五个除法才能隔离。当数据表示为树结构时，异常更有可能在比正常数据点浅得多的深度上更接近根节点。如图 4-4 （a）所示，异常值（8， 12）的树深度为 1，而正常数据点（9， 15）的树深度为 5。

隔离林不需要特征缩放，因为检测异常值时使用的距离阈值基于树深度。它适用于大型和小型数据集，并且不需要训练数据集，因为它是一种无监督学习技术。

图 4-4隔离异常和正常数据点所需的除法数带隔离林
与其他基于树的融合类似，隔离林建立在一组称为隔离树的决策树上，每棵树都有整个数据集的子集。异常分数计算为森林中树木的平均异常分数。异常分数派生自拆分数据点所需的条件数。接近 1 的异常评分表示异常，而低于 0.5 的评分表示非异常观察结果（图 4-5）。

图 4-5使用隔离林检测异常
隔离林在准确性和性能方面都优于其他异常检测方法。图 4-6 和图 4-7 显示了隔离林与单类支持向量机（另一种众所周知的异常值检测算法）的性能比较。十五第一个测试根据属于单个组的正常观察值评估了两种算法（图 4-6），而第二个测试根据属于两个不均匀聚类的观察值评估了这两种算法（图 4-7）。在这两种情况下，隔离林的表现都优于单类支持向量机。

图4-6隔离林与单类支持向量机–单组的正常观察（图像由亚历杭德罗·科雷亚·班森提供）

图4-7隔离林与单类支持向量机-不均匀的集群（图片由亚历杭德罗·科雷亚·班森提供）
Spark-iForest是杨方舟在方洁和几位贡献者的帮助下开发的火花隔离森林算法的实现。它作为外部第三方软件包提供，未包含在标准 Apache 火花 MLlib 库中。您可以通过访问火花森林GitHub页面 https://github.com/titicaca/spark-iforest 找到有关火花森林以及最新JAR文件的更多信息。十六

参数

这是Spark-iForest支持的参数列表。如您所见，某些参数类似于其他基于树的融合，例如随机森林。

maxFeatures：从数据中提取以训练每棵树的特征数（>0）。如果 max 特征 <= 1，则算法将绘制最大特征∗总特征。如果最大特征> 1，则算法将绘制最大特征。
maxDepth：构造树时使用的高度限制（>0）。默认值将是关于 log2（数字样本）。
numTrees：假设模型中的树数（>0）。
maxSamples：从数据中提取以训练每棵树的样本数（>0）。如果 maxSamples <= 1，则算法将绘制 maxSample ∗总采样样本。如果最大样本> 1，则算法将绘制最大样本。总内存约为最大样本数∗数字树∗ 4 + max采样∗ 8 字节。
contamination：数据集中异常值的比例;该值应为（0， 1）。它仅在预测阶段用于将异常分数转换为预测标签。为了提高性能，我们获取异常评分阈值的方法按近似数计算。您可以设置参数近似值双相关错误大于 0，以便计算大型数据集异常评分的近似分位数阈值。
approxQuantileRelativeError：近似分位数计算的相对误差（0 <= 值 <= 1）;default 为 0 以计算精确值，这对于大型数据集来说成本很高。
bootstrap：如果为 true，则单个树拟合到使用替换抽样的训练数据的随机子集上。如果为 false，则执行无替换采样。
seed：随机数生成器使用的种子。
功能列名称，默认的“功能”。
anomalyScoreCol：异常分数列名称，默认为“异常分数”。
predictionCol：预测列名称，默认的“预测”。

例子

我们将使用 Spark-iForest 通过威斯康星州乳腺癌数据集（表 4-2）预测乳腺癌的发生（清单 4-10），该数据集可从 UCI 机器学习存储库获得。十八
表4-2威斯康星州乳腺癌数据集

Index	Feature	Domain
1	Sample code number	id number
2	Clump thickness	1–10
3	Uniformity of cell size	1–10
4	Uniformity of cell shape	1–10
5	Marginal adhesion	1–10
6	Single epithelial cell size	1–10
7	Bare nuclei	1–10
8	Bland chromatin	1–10
9	Normal nucleoli	1–10
10	Mitoses	1–10
11	Class	(2 for benign, 4 for malignant)

spark-shell --jars spark-iforest-1.0-SNAPSHOT.jar
import org.apache.spark.sql.types._
var dataSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("clump_thickness", IntegerType, true),
StructField("ucell_size", IntegerType, true),
StructField("ucell_shape", IntegerType, true),
StructField("marginal_ad", IntegerType, true),
StructField("se_cellsize", IntegerType, true),
StructField("bare_nuclei", IntegerType, true),
StructField("bland_chromatin", IntegerType, true),
StructField("normal_nucleoli", IntegerType, true),
StructField("mitosis", IntegerType, true),
StructField("class", IntegerType, true)
    ))
val dataDF = spark.read.option("inferSchema", "true")
             .schema(dataSchema)
             .csv("/files/breast-cancer-wisconsin.csv")
dataDF.printSchema
//The dataset contain 16 rows with missing attribute values.
//We'll remove them for this exercise.
val dataDF2 = dataDF.filter("bare_nuclei is not null")
val seed = 1234
val Array(trainingData, testData) = dataDF2.randomSplit(Array(0.8, 0.2), seed)
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().setInputCol("class").setOutputCol("label")
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
                .setInputCols(Array("clump_thickness",
                "ucell_size", "ucell_shape", "marginal_ad", "se_cellsize", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis"))
                .setOutputCol("features")
import org.apache.spark.ml.iforest._
val iForest = new IForest()
              .setMaxSamples(150)
              .setContamination(0.30)
              .setBootstrap(false)
              .setSeed(seed)
              .setNumTrees(100)
              .setMaxDepth(50)
val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, assembler, iForest))
val model = pipeline.fit(trainingData)
val predictions = model.transform(testData)
predictions.select("id","features","anomalyScore","prediction").show()
+------+--------------------+-------------------+----------+
|    id|            features|       anomalyScore|prediction|
+------+--------------------+-------------------+----------+
| 63375|[9.0,1.0,2.0,6.0,...| 0.6425205920636737|       1.0|
| 76389|[10.0,4.0,7.0,2.0...| 0.6475157383643779|       1.0|
| 95719|[6.0,10.0,10.0,10...| 0.6413247885878359|       1.0|
|242970|[5.0,7.0,7.0,1.0,...| 0.6156526231532693|       1.0|
|353098|[4.0,1.0,1.0,2.0,...|0.45686731187686386|       0.0|
|369565|[4.0,1.0,1.0,1.0,...|0.45957810648090186|       0.0|
|390840|[8.0,4.0,7.0,1.0,...| 0.6387497388682214|       1.0|
|412300|[10.0,4.0,5.0,4.0...| 0.6104797020175959|       1.0|
|466906|[1.0,1.0,1.0,1.0,...|0.41857428772927696|       0.0|
|476903|[10.0,5.0,7.0,3.0...| 0.6152957125696049|       1.0|
|486283|[3.0,1.0,1.0,1.0,...|0.47218763124223706|       0.0|
|557583|[5.0,10.0,10.0,10...| 0.6822227844447365|       1.0|
|636437|[1.0,1.0,1.0,1.0,...|0.41857428772927696|       0.0|
|654244|[1.0,1.0,1.0,1.0,...| 0.4163657637214968|       0.0|
|657753|[3.0,1.0,1.0,4.0,...|0.49314746153500594|       0.0|
|666090|[1.0,1.0,1.0,1.0,...|0.45842258207090547|       0.0|
|688033|[1.0,1.0,1.0,1.0,...|0.41857428772927696|       0.0|
|690557|[5.0,1.0,1.0,1.0,...| 0.4819098604217553|       0.0|
|704097|[1.0,1.0,1.0,1.0,...| 0.4163657637214968|       0.0|
|770066|[5.0,2.0,2.0,2.0,...| 0.5125093127301371|       0.0|
+------+--------------------+-------------------+----------+
only showing top 20 rows

清单 4-10使用隔离林进行异常检测
我们不能使用二进制分类计算器来评估隔离林模型，因为它期望原始预测字段存在于输出中。“星火”生成异常“分数”字段，而不是原始“预测”。我们将改用二进制分类度量来评估模型。

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
val binaryMetrics = new BinaryClassificationMetrics(
predictions.select("prediction", "label").rdd.map {
case Row(prediction: Double, label: Double) => (prediction, label)
}
)
println(s"AUC: ${binaryMetrics.areaUnderROC()}")
AUC: 0.9532866199532866

使用主成分分析进行降维

主成分分析（PCA）是一种无监督机器学习技术，用于降低特征空间的维度。它检测要素之间的相关性并生成减少的线性不相关要素的数量，同时保留原始数据集中的大部分方差。这些更紧凑、线性不相关的特征称为主成分。主成分按其解释方差的降序排序。当数据集中存在大量要素时，降维至关重要。例如，基因组学和工业分析领域的机器学习用例通常涉及数千甚至数百万个特征。高维性使模型更加复杂，增加了过拟合的机会。在某个点添加更多功能实际上会降低模型的性能。此外，对高维数据的训练需要大量的计算资源。这些被统称为维度的诅咒。降维技术旨在克服维度的诅咒。

请注意，PCA 生成的主成分将不可解释。在您需要了解为什么进行预测的情况下，这是一个交易破坏者。此外，在应用 PCA 之前，必须对数据集进行标准化，以防止认为规模最大的特征比其他特征更重要。

例子

对于我们的示例，我们将在 Iris 数据集上使用 PCA 将四维特征向量投影到二维主成分中（参见清单 4-11）。

import org.apache.spark.ml.feature.{PCA, VectorAssembler}
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.types._
val irisSchema = StructType(Array (
StructField("sepal_length",   DoubleType, true),
StructField("sepal_width",   DoubleType, true),
StructField("petal_length",   DoubleType, true),
StructField("petal_width",   DoubleType, true),
StructField("class",  StringType, true)
))
val dataDF = spark.read.format("csv")
             .option("header", "false")
             .schema(irisSchema)
             .load("/files/iris.data")
dataDF.printSchema
root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- class: string (nullable = true)
dataDF.show
+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|      class|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|
|         4.6|        3.4|         1.4|        0.3|Iris-setosa|
|         5.0|        3.4|         1.5|        0.2|Iris-setosa|
|         4.4|        2.9|         1.4|        0.2|Iris-setosa|
|         4.9|        3.1|         1.5|        0.1|Iris-setosa|
|         5.4|        3.7|         1.5|        0.2|Iris-setosa|
|         4.8|        3.4|         1.6|        0.2|Iris-setosa|
|         4.8|        3.0|         1.4|        0.1|Iris-setosa|
|         4.3|        3.0|         1.1|        0.1|Iris-setosa|
|         5.8|        4.0|         1.2|        0.2|Iris-setosa|
|         5.7|        4.4|         1.5|        0.4|Iris-setosa|
|         5.4|        3.9|         1.3|        0.4|Iris-setosa|
|         5.1|        3.5|         1.4|        0.3|Iris-setosa|
|         5.7|        3.8|         1.7|        0.3|Iris-setosa|
|         5.1|        3.8|         1.5|        0.3|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 20 rows
dataDF.describe().show(5,15)
+-------+---------------+---------------+---------------+---------------+
|summary|   sepal_length|    sepal_width|   petal_length|    petal_width|
+-------+---------------+---------------+---------------+---------------+
|  count|            150|            150|            150|            150|
|   mean|5.8433333333...|3.0540000000...|3.7586666666...|1.1986666666...|
| stddev|0.8280661279...|0.4335943113...|1.7644204199...|0.7631607417...|
|    min|            4.3|            2.0|            1.0|            0.1|
|    max|            7.9|            4.4|            6.9|            2.5|
+-------+---------------+---------------+---------------+---------------+
+--------------+
|         class|
+--------------+
|           150|
|          null|
|          null|
|   Iris-setosa|
|Iris-virginica|
+--------------+
val labelIndexer = new StringIndexer()
                   .setInputCol("class")
                   .setOutputCol("label")
val dataDF2 = labelIndexer.fit(dataDF).transform(dataDF)
dataDF2.printSchema
root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- class: string (nullable = true)
 |-- label: double (nullable = false)
dataDF2.show
+------------+-----------+------------+-----------+-----------+-----+
|sepal_length|sepal_width|petal_length|petal_width|      class|label|
+------------+-----------+------------+-----------+-----------+-----+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|  0.0|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|  0.0|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|  0.0|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|  0.0|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|  0.0|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|  0.0|
|         4.6|        3.4|         1.4|        0.3|Iris-setosa|  0.0|
|         5.0|        3.4|         1.5|        0.2|Iris-setosa|  0.0|
|         4.4|        2.9|         1.4|        0.2|Iris-setosa|  0.0|
|         4.9|        3.1|         1.5|        0.1|Iris-setosa|  0.0|
|         5.4|        3.7|         1.5|        0.2|Iris-setosa|  0.0|
|         4.8|        3.4|         1.6|        0.2|Iris-setosa|  0.0|
|         4.8|        3.0|         1.4|        0.1|Iris-setosa|  0.0|
|         4.3|        3.0|         1.1|        0.1|Iris-setosa|  0.0|
|         5.8|        4.0|         1.2|        0.2|Iris-setosa|  0.0|
|         5.7|        4.4|         1.5|        0.4|Iris-setosa|  0.0|
|         5.4|        3.9|         1.3|        0.4|Iris-setosa|  0.0|
|         5.1|        3.5|         1.4|        0.3|Iris-setosa|  0.0|
|         5.7|        3.8|         1.7|        0.3|Iris-setosa|  0.0|
|         5.1|        3.8|         1.5|        0.3|Iris-setosa|  0.0|
+------------+-----------+------------+-----------+-----------+-----+
only showing top 20 rows
import org.apache.spark.ml.feature.VectorAssembler
val features = Array("sepal_length","sepal_width","petal_length","petal_width")
val assembler = new VectorAssembler()
                .setInputCols(features)
                .setOutputCol("features")
val dataDF3 = assembler.transform(dataDF2)
dataDF3.printSchema
root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- class: string (nullable = true)
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
dataDF3.show
+------------+-----------+------------+-----------+-----------+-----+
|sepal_length|sepal_width|petal_length|petal_width|      class|label|
+------------+-----------+------------+-----------+-----------+-----+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|  0.0|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|  0.0|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|  0.0|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|  0.0|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|  0.0|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|  0.0|
|         4.6|        3.4|         1.4|        0.3|Iris-setosa|  0.0|
|         5.0|        3.4|         1.5|        0.2|Iris-setosa|  0.0|
|         4.4|        2.9|         1.4|        0.2|Iris-setosa|  0.0|
|         4.9|        3.1|         1.5|        0.1|Iris-setosa|  0.0|
|         5.4|        3.7|         1.5|        0.2|Iris-setosa|  0.0|
|         4.8|        3.4|         1.6|        0.2|Iris-setosa|  0.0|
|         4.8|        3.0|         1.4|        0.1|Iris-setosa|  0.0|
|         4.3|        3.0|         1.1|        0.1|Iris-setosa|  0.0|
|         5.8|        4.0|         1.2|        0.2|Iris-setosa|  0.0|
|         5.7|        4.4|         1.5|        0.4|Iris-setosa|  0.0|
|         5.4|        3.9|         1.3|        0.4|Iris-setosa|  0.0|
|         5.1|        3.5|         1.4|        0.3|Iris-setosa|  0.0|
|         5.7|        3.8|         1.7|        0.3|Iris-setosa|  0.0|
|         5.1|        3.8|         1.5|        0.3|Iris-setosa|  0.0|
+------------+-----------+------------+-----------+-----------+-----+
+-----------------+
|         features|
+-----------------+
|[5.1,3.5,1.4,0.2]|
|[4.9,3.0,1.4,0.2]|
|[4.7,3.2,1.3,0.2]|
|[4.6,3.1,1.5,0.2]|
|[5.0,3.6,1.4,0.2]|
|[5.4,3.9,1.7,0.4]|
|[4.6,3.4,1.4,0.3]|
|[5.0,3.4,1.5,0.2]|
|[4.4,2.9,1.4,0.2]|
|[4.9,3.1,1.5,0.1]|
|[5.4,3.7,1.5,0.2]|
|[4.8,3.4,1.6,0.2]|
|[4.8,3.0,1.4,0.1]|
|[4.3,3.0,1.1,0.1]|
|[5.8,4.0,1.2,0.2]|
|[5.7,4.4,1.5,0.4]|
|[5.4,3.9,1.3,0.4]|
|[5.1,3.5,1.4,0.3]|
|[5.7,3.8,1.7,0.3]|
|[5.1,3.8,1.5,0.3]|
+-----------------+
//我们将标准化四个属性（sepal_length，sepal_width，
//petal_length和petal_width）使用标准缩放器，即使它们都
//具有相同的刻度并测量相同的数量。如前所述，
// 标准化被认为是最佳实践，并且是
//许多算法，如PCA，以最佳方式执行。
import org.apache.spark.ml.feature.StandardScaler
val scaler = new StandardScaler()
             .setInputCol("features")
             .setOutputCol("scaledFeatures")
             .setWithStd(true)
             .setWithMean(false)
val dataDF4 = scaler.fit(dataDF3).transform(dataDF3)
dataDF4.printSchema
root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- class: string (nullable = true)
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)
// 生成两个主组件。
val pca = new PCA()
          .setInputCol("scaledFeatures")
          .setOutputCol("pcaFeatures")
          .setK(2)
          .fit(dataDF4)
val dataDF5 = pca.transform(dataDF4)
dataDF5.printSchema
root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- class: string (nullable = true)
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)
 |-- pcaFeatures: vector (nullable = true)
dataDF5.select("scaledFeatures","pcaFeatures").show(false)
+-------------------------------------------------------------------------+
|scaledFeatures                                                           |
+-------------------------------------------------------------------------+
|[6.158928408838787,8.072061621390857,0.7934616853039358,0.26206798787142]|
|[5.9174018045706,6.9189099611921625,0.7934616853039358,0.26206798787142] |
|[5.675875200302412,7.38017062527164,0.7367858506393691,0.26206798787142] |
|[5.555111898168318,7.149540293231902,0.8501375199685027,0.26206798787142]|
|[6.038165106704694,8.302691953430596,0.7934616853039358,0.26206798787142]|
|[6.52121831524107,8.99458294954981,0.9634891892976364,0.52413597574284]  |
|[5.555111898168318,7.841431289351117,0.7934616853039358,0.39310198180713]|
|[6.038165106704694,7.841431289351117,0.8501375199685027,0.26206798787142]|
|[5.313585293900131,6.688279629152423,0.7934616853039358,0.26206798787142]|
|[5.9174018045706,7.149540293231902,0.8501375199685027,0.13103399393571]  |
|[6.52121831524107,8.533322285470334,0.8501375199685027,0.26206798787142] |
|[5.7966385024365055,7.841431289351117,0.9068133546330697,0.262067987871] |
|[5.7966385024365055,6.9189099611921625,0.7934616853039358,0.131033993935]|
|[5.192821991766037,6.9189099611921625,0.6234341813102354,0.1310339939351]|
|[7.004271523777445,9.22521328158955,0.6801100159748021,0.26206798787142] |
|[6.883508221643351,10.147734609748506,0.8501375199685027,0.524135975742] |
|[6.52121831524107,8.99458294954981,0.7367858506393691,0.52413597574284]  |
|[6.158928408838787,8.072061621390857,0.7934616853039358,0.39310198180713]|
|[6.883508221643351,8.763952617510071,0.9634891892976364,0.39310198180713]|
|[6.158928408838787,8.763952617510071,0.8501375199685027,0.39310198180713]|
+-------------------------------------------------------------------------+
+-----------------------------------------+
|pcaFeatures                              |
+-----------------------------------------+
|[-1.7008636408214346,-9.798112476165109] |
|[-1.8783851549940478,-8.640880678324866] |
|[-1.597800192305247,-8.976683127367169]  |
|[-1.6613406138855684,-8.720650458966217] |
|[-1.5770426874367196,-9.96661148272853]  |
|[-1.8942207975522354,-10.80757533867312] |
|[-1.5202989381570455,-9.368410789070643] |
|[-1.7314610064823877,-9.540884243679617] |
|[-1.6237061774493644,-8.202607301741613] |
|[-1.7764763044699745,-8.846965954487347] |
|[-1.8015813990792064,-10.361118028393015]|
|[-1.6382374187586244,-9.452155017757546] |
|[-1.741187558292187,-8.587346593832775]  |
|[-1.3269417814262463,-8.358947926562632] |
|[-1.7728726239179156,-11.177765120852797]|
|[-1.7138964933624494,-12.00737840334759] |
|[-1.7624485738747564,-10.80279308233496] |
|[-1.7624485738747564,-10.80279308233496] |
|[-1.7624485738747564,-10.80279308233496] |
|[-1.6257080769316516,-10.44826393443861] |
+-----------------------------------------+

清单 4-11使用 PCA 减小尺寸
如前所述，鸢尾花数据集有三种花（鸢尾花濑、鸢尾花和鸢尾花弗吉尼亚花）。它有四个属性（萼片长度，萼片宽度，花瓣长度和花瓣宽度）。让我们在两个主成分上绘制样本。从图 4-8 中可以看出，鸢尾花 Setosa 与其他两类完全分离，而鸢尾花和鸢尾花维吉尼卡略有重叠。

图 4-8 虹膜数据集的 PCA 投影
解释的变值方法返回一个向量，其中包含由每个主成分解释的方差比例。我们的目标是在新的主成分中保留尽可能多的方差。

pca.explainedVariance
res5: org.apache.spark.ml.linalg.DenseVector = [0.7277045209380264,0.23030523267679512]

根据方法的输出，第一个主成分解释了72.77%的方差，而23.03%的方差由第二个主成分解释。累积起来，两个主成分解释了95.8%的方差。如您所见，当我们缩小尺寸时，我们丢失了一些信息。如果在保持良好的模型准确性的同时，有实质性的训练性能改进，这通常是可以接受的权衡。

总结

我们讨论了几种无监督学习技术，并学习如何将它们应用于现实世界的业务用例。近年来，随着大数据的出现，无监督学习的受欢迎程度重新抬头。聚类分析、异常检测和主成分分析等技术有助于理解移动和物联网设备、传感器、社交媒体等生成的大量非结构化数据。它是机器学习武器库中的强大工具。