Pyspark特征工程--Normalizer

Normalizer归一化

class pyspark.ml.feature.Normalizer(p=2.0, inputCol=None, outputCol=None)

使用给定的 p 范数将向量形式化为具有单位范数

1-范数:║x║1=│x1│+│x2│+…+│xn│

2-范数:
∣ ∣ x ∣ ∣ 2 = ( ∣ ∣ x 1 ∣ ∣ 2 + ∣ ∣ x 2 ∣ ∣ 2 + . . . + ∣ ∣ x n ∣ ∣ 2 ) ( 1 / 2 ) ||x||^2 = (||x1||^2+||x2||^2+...+||xn||^2)^(1/2) x2=(x12+x22+...+xn2)(1/2)
∞-范数:║x║∞=max(│x1│,│x2│,…,│xn│)

01.初始化:

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("Normalizer").master("local[*]").getOrCreate()

02.创建数据(一个稀疏向量,一个密集向量)

from pyspark.ml.linalg import Vectors
svec = Vectors.sparse(4, {1: 4.0, 3: 3.0})
df = spark.createDataFrame([(Vectors.dense([3.0, -4.0]), svec)], ["dense", "sparse"])
df.show()
df.printSchema()

​ 输出结果:

+----------+-------------------+
|     dense|             sparse|
+----------+-------------------+
|[3.0,-4.0]|(4,[1,3],[4.0,3.0])|
+----------+-------------------+

root
 |-- dense: vector (nullable = true)
 |-- sparse: vector (nullable = true

03.将密集向量按照2阶范数进行标准化

from pyspark.ml.feature import Normalizer
normalizer = Normalizer(p=2.0, inputCol="dense", outputCol="features")
normalizer.transform(df).show()
normalizer.transform(df).head(1)

​ 输出结果:

+----------+-------------------+----------+
|     dense|             sparse|  features|
+----------+-------------------+----------+
|[3.0,-4.0]|(4,[1,3],[4.0,3.0])|[0.6,-0.8]|
+----------+-------------------+----------+

[Row(dense=DenseVector([3.0, -4.0]), sparse=SparseVector(4, {1: 4.0, 3: 3.0}), features=DenseVector([0.6, -0.8]))]

03.将稀疏向量按照2阶范数进行标准化

normalizer.setParams(inputCol="sparse", outputCol="freqs").transform(df).show()

​ 输出结果:

+----------+-------------------+-------------------+
|     dense|             sparse|              freqs|
+----------+-------------------+-------------------+
|[3.0,-4.0]|(4,[1,3],[4.0,3.0])|(4,[1,3],[0.8,0.6])|
+----------+-------------------+-------------------+

04.重新修改model的范数阶数列名等参数,将密集向量按照1阶范数进行标准化

params = {normalizer.p: 1.0, normalizer.inputCol: "dense", normalizer.outputCol: "vector"}
normalizer.transform(df, params).show()
normalizer.transform(df, params).head(1)

​ 输出结果:

+----------+-------------------+--------------------+
|     dense|             sparse|              vector|
+----------+-------------------+--------------------+
|[3.0,-4.0]|(4,[1,3],[4.0,3.0])|[0.42857142857142...|
+----------+-------------------+--------------------+

[Row(dense=DenseVector([3.0, -4.0]), sparse=SparseVector(4, {1: 4.0, 3: 3.0}), vector=DenseVector([0.4286, -0.5714]))]
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值