VectorSizeHint
class pyspark.ml.feature.VectorSizeHint(inputCol=None, size=None, handleInvalid=‘error’)
将大小信息添加到向量列的元数据的特征转换器。 VectorAssembler 需要其输入列的大小信息,并且不能在没有此元数据的情况下用于流数据帧
VectorSizeHint 修改 inputCol 以包含大小元数据并且没有 outputCol
只有指定大小的才能使用
01.创建数据
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("VectorSizeHint").master("local[*]").getOrCreate()
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel
data = [(Vectors.dense([1., 2., 3.]), 4.)]
df = spark.createDataFrame(data, ["vector", "float"])
df.show()
输出结果:
+-------------+-----+
| vector|float|
+-------------+-----+
|[1.0,2.0,3.0]| 4.0|
+-------------+-----+
02.使用Pipeline管道和VectorSizeHint
from pyspark.ml.feature import VectorSizeHint,VectorAssembler
sizeHint = VectorSizeHint(inputCol="vector", size=3, handleInvalid="skip")
vecAssembler = VectorAssembler(inputCols=["vector", "float"], outputCol="assembled")
pipeline = Pipeline(stages=[sizeHint, vecAssembler])
pipelineModel = pipeline.fit(df)
pipelineModel.transform(df).show()
输出结果:
+-------------+-----+-----------------+
| vector|float| assembled|
+-------------+-----+-----------------+
|[1.0,2.0,3.0]| 4.0|[1.0,2.0,3.0,4.0]|
+-------------+-----+-----------------+
ps:这儿发现体现不出VectorSizeHint的效果
03.单独查看VectorSizeHint的效果:
sizeHint.transform(df).show()
输出结果:
+-------------+-----+
| vector|float|
+-------------+-----+
|[1.0,2.0,3.0]| 4.0|
+-------------+-----+
看不出结果,改变参数试试
sizeHint.setParams(size=4)
sizeHint.transform(df).show()
输出结果:
+------+-----+
|vector|float|
+------+-----+
+------+-----+
再改变参数:
sizeHint.setParams(size=2)
sizeHint.transform(df).show()
输出结果:
+------+-----+
|vector|float|
+------+-----+
+------+-----+