PySpark_ML_相关API特征工程处理

特征工程

spark的特征处理功能主要在 pyspark.ml.feature 模块中,包括以下一些功能。

  • 特征提取:Tf-idf, Word2Vec, CountVectorizer, FeatureHasher

  • 特征转换:OneHotEncoderEstimator, Normalizer, Imputer(缺失值填充), StandardScaler, MinMaxScaler, Tokenizer(构建词典),
    StopWordsRemover, SQLTransformer, Bucketizer, Interaction(交叉项), Binarizer(二值化), n-gram,……

  • 特征选择:VectorSlicer(向量切片), RFormula, ChiSqSelector(卡方检验)

  • LSH转换:局部敏感哈希广泛用于海量数据中求最邻近,聚类等算法。

1、CountVectorizer

CountVectorizer可以提取文本中的词频特征

from pyspark.ml.feature import CountVectorizer, CountVectorizerModel

df = spark.createDataFrame([
  (0, ["a", "b", "c"]),
  (1, ["a", "b", "b", "c", "a"])],["id","words"])

cvModel = CountVectorizer() \
  .setInputCol("words") \
  .setOutputCol("features") \
  .setVocabSize(3) \
  .setMinDF(2) \
  .fit(df)

cvModel.transform(df).show()

2、Word2Vec

Word2Vec可以使用浅层神经网络提取文本中词的相似语义信息。

from pyspark.ml.feature import Word2Vec

df_document = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])

word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(df_document)

df_vector = model.transform(df_document)
for row in df_vector.collect():
    text, vector = row
    print("text: [%s] => \nvector: %s\n" % (", ".join(text), str(vector)))

text: [Hi, I, heard, about, Spark] => 
vector: [-0.03952452838420868,-0.019742850959300996,-0.04259629175066948]

text: [I, wish, Java, could, use, case, classes] => 
vector: [-0.017589610069990158,0.03303118874984128,-0.03793099456067596]

text: [Logistic, regression, models, are, neat] => 
vector: [-0.03930013366043568,0.08479443639516832,-0.025407366454601288]

3、OnHotEncoder

OneHotEncoder可以将类别特征转换成OneHot编码。

from pyspark.ml.feature import OneHotEncoder

df = spark.createDataFrame([
    (0.0, 1.0),
    (1.0, 0.0),
    (2.0, 1.0),
    (0.0, 2.0),
    (0.0, 1.0),
    (2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])

encoder = OneHotEncoder(inputCols=["categoryIndex1", "categoryIndex2"],
                                 outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()

+--------------+--------------+-------------+-------------+
|categoryIndex1|categoryIndex2| categoryVec1| categoryVec2|
+--------------+--------------+-------------+-------------+
|           0.0|           1.0|(2,[0],[1.0])|(2,[1],[1.0])|
|           1.0|           0.0|(2,[1],[1.0])|(2,[0],[1.0])|
|           2.0|           1.0|    (2,[],[])|(2,[1],[1.0])|
|           0.0|           2.0|(2,[0],[1.0])|    (2,[],[])|
|           0.0|           1.0|(2,[0],[1.0])|(2,[1],[1.0])|
|           2.0|           0.0|    (2,[],[])|(2,[0],[1.0])|
+--------------+--------------+-------------+-------------+
4、 MinMax标准化
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])

scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

scalerModel = scaler.fit(df)

df_scaled = scalerModel.transform(df)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
df_scaled.select("features", "scaledFeatures").show()

Features scaled to range: [0.000000, 1.000000]
+--------------+--------------+
|      features|scaledFeatures|
+--------------+--------------+
|[1.0,0.1,-1.0]|     (3,[],[])|
| [2.0,1.1,1.0]| [0.5,0.1,0.5]|
|[3.0,10.1,3.0]| [1.0,1.0,1.0]|
+--------------+--------------+

5、MaxAbsScaler标准化
from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -8.0]),),
    (1, Vectors.dense([2.0, 1.0, -4.0]),),
    (2, Vectors.dense([4.0, 10.0, 8.0]),)
], ["id", "features"])

scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures")

scalerModel = scaler.fit(df)

df_rescaled = scalerModel.transform(df)

df_rescaled.select("features", "scaledFeatures").show()

+--------------+--------------------+
|      features|      scaledFeatures|
+--------------+--------------------+
|[1.0,0.1,-8.0]|[0.25,0.010000000...|
|[2.0,1.0,-4.0]|      [0.5,0.1,-0.5]|
|[4.0,10.0,8.0]|       [1.0,1.0,1.0]|
+--------------+--------------------+

6、SQLTransformer

可以使用SQL语法将DataFrame进行转换,等效于注册表的作用。

但它可以用于Pipeline中作为Transformer.

from pyspark.ml.feature import SQLTransformer

df = spark.createDataFrame([
    (0, 1.0, 3.0),
    (2, 2.0, 5.0)
], ["id", "v1", "v2"])
sqlTrans = SQLTransformer(
    statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")

sqlTrans.transform(df).show()
+---+---+---+---+----+
| id| v1| v2| v3|  v4|
+---+---+---+---+----+
|  0|1.0|3.0|4.0| 3.0|
|  2|2.0|5.0|7.0|10.0|
+---+---+---+---+----+

7、 Imputer

Imputer转换器可以填充缺失值,缺失值可以用 float(“nan”)来表示。

from pyspark.ml.feature import Imputer

df = spark.createDataFrame([
    (1.0, float("nan")),
    (2.0, float("nan")),
    (float("nan"), 3.0),
    (4.0, 4.0),
    (5.0, 5.0)
], ["a", "b"])

imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)

model.transform(df).show()

+---+---+-----+-----+
|  a|  b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN|  1.0|  4.0|
|2.0|NaN|  2.0|  4.0|
|NaN|3.0|  3.0|  3.0|
|4.0|4.0|  4.0|  4.0|
|5.0|5.0|  5.0|  5.0|
+---+---+-----+-----+

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Elvis_hui

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值