Normalizer归一化
class pyspark.ml.feature.Normalizer(p=2.0, inputCol=None, outputCol=None)
使用给定的 p 范数将向量形式化为具有单位范数
1-范数:║x║1=│x1│+│x2│+…+│xn│
2-范数:
∣
∣
x
∣
∣
2
=
(
∣
∣
x
1
∣
∣
2
+
∣
∣
x
2
∣
∣
2
+
.
.
.
+
∣
∣
x
n
∣
∣
2
)
(
1
/
2
)
||x||^2 = (||x1||^2+||x2||^2+...+||xn||^2)^(1/2)
∣∣x∣∣2=(∣∣x1∣∣2+∣∣x2∣∣2+...+∣∣xn∣∣2)(1/2)
∞-范数:║x║∞=max(│x1│,│x2│,…,│xn│)
01.初始化:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("Normalizer").master("local[*]").getOrCreate()
02.创建数据(一个稀疏向量,一个密集向量)
from pyspark.ml.linalg import Vectors
svec = Vectors.sparse(4, {1: 4.0, 3: 3.0})
df = spark.createDataFrame([(Vectors.dense([3.0, -4.0]), svec)], ["dense", "sparse"])
df.show()
df.printSchema()
输出结果:
+----------+-------------------+
| dense| sparse|
+----------+-------------------+
|[3.0,-4.0]|(4,[1,3],[4.0,3.0])|
+----------+-------------------+
root
|-- dense: vector (nullable = true)
|-- sparse: vector (nullable = true
03.将密集向量按照2阶范数进行标准化
from pyspark.ml.feature import Normalizer
normalizer = Normalizer(p=2.0, inputCol="dense", outputCol="features")
normalizer.transform(df).show()
normalizer.transform(df).head(1)
输出结果:
+----------+-------------------+----------+
| dense| sparse| features|
+----------+-------------------+----------+
|[3.0,-4.0]|(4,[1,3],[4.0,3.0])|[0.6,-0.8]|
+----------+-------------------+----------+
[Row(dense=DenseVector([3.0, -4.0]), sparse=SparseVector(4, {1: 4.0, 3: 3.0}), features=DenseVector([0.6, -0.8]))]
03.将稀疏向量按照2阶范数进行标准化
normalizer.setParams(inputCol="sparse", outputCol="freqs").transform(df).show()
输出结果:
+----------+-------------------+-------------------+
| dense| sparse| freqs|
+----------+-------------------+-------------------+
|[3.0,-4.0]|(4,[1,3],[4.0,3.0])|(4,[1,3],[0.8,0.6])|
+----------+-------------------+-------------------+
04.重新修改model的范数阶数列名等参数,将密集向量按照1阶范数进行标准化
params = {normalizer.p: 1.0, normalizer.inputCol: "dense", normalizer.outputCol: "vector"}
normalizer.transform(df, params).show()
normalizer.transform(df, params).head(1)
输出结果:
+----------+-------------------+--------------------+
| dense| sparse| vector|
+----------+-------------------+--------------------+
|[3.0,-4.0]|(4,[1,3],[4.0,3.0])|[0.42857142857142...|
+----------+-------------------+--------------------+
[Row(dense=DenseVector([3.0, -4.0]), sparse=SparseVector(4, {1: 4.0, 3: 3.0}), vector=DenseVector([0.4286, -0.5714]))]