ML下的特征工程相关API
特征工程
spark的特征处理功能主要在 pyspark.ml.feature 模块中,包括以下一些功能。
-
特征提取:Tf-idf, Word2Vec, CountVectorizer, FeatureHasher
-
特征转换:OneHotEncoderEstimator, Normalizer, Imputer(缺失值填充), StandardScaler, MinMaxScaler, Tokenizer(构建词典),
StopWordsRemover, SQLTransformer, Bucketizer, Interaction(交叉项), Binarizer(二值化), n-gram,…… -
特征选择:VectorSlicer(向量切片), RFormula, ChiSqSelector(卡方检验)
-
LSH转换:局部敏感哈希广泛用于海量数据中求最邻近,聚类等算法。
1、CountVectorizer
CountVectorizer可以提取文本中的词频特征。
from pyspark.ml.feature import CountVectorizer, CountVectorizerModel
df = spark.createDataFrame([
(0, ["a", "b", "c"]),
(1, ["a", "b", "b", "c", "a"])],["id","words"])
cvModel = CountVectorizer() \
.setInputCol("words") \
.setOutputCol("features") \
.setVocabSize(3) \
.setMinDF(2) \
.fit(df)
cvModel.transform(df).show()
2、Word2Vec
Word2Vec可以使用浅层神经网络提取文本中词的相似语义信息。
from pyspark.ml.feature import Word2Vec
df_document = spark.createDataFrame([
("Hi I heard about Spark".split(" "), ),
("I wish Java could use case classes".split(" "), ),
("Logistic regression models are neat".split(" "), )
], ["text"])
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(df_document)
df_vector = model.transform(df_document)
for row in df_vector.collect():
text, vector = row
print("text: [%s] => \nvector: %s\n" % (", ".join(text), str(vector)))
text: [Hi, I, heard, about, Spark] =>
vector: [-0.03952452838420868,-0.019742850959300996,-0.04259629175066948]
text: [I, wish, Java, could, use, case, classes] =>
vector: [-0.017589610069990158,0.03303118874984128,-0.03793099456067596]
text: [Logistic, regression, models, are, neat] =>
vector: [-0.03930013366043568,0.08479443639516832,-0.025407366454601288]
3、OnHotEncoder
OneHotEncoder可以将类别特征转换成OneHot编码。
from pyspark.ml.feature import OneHotEncoder
df = spark.createDataFrame([
(0.0, 1.0),
(1.0, 0.0),
(2.0, 1.0),
(0.0, 2.0),
(0.0, 1.0),
(2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])
encoder = OneHotEncoder(inputCols=["categoryIndex1", "categoryIndex2"],
outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()
+--------------+--------------+-------------+-------------+
|categoryIndex1|categoryIndex2| categoryVec1| categoryVec2|
+--------------+--------------+-------------+-------------+
| 0.0| 1.0|(2,[0],[1.0])|(2,[1],[1.0])|
| 1.0| 0.0|(2,[1],[1.0])|(2,[0],[1.0])|
| 2.0| 1.0| (2,[],[])|(2,[1],[1.0])|
| 0.0| 2.0|(2,[0],[1.0])| (2,[],[])|
| 0.0| 1.0|(2,[0],[1.0])|(2,[1],[1.0])|
| 2.0| 0.0| (2,[],[])|(2,[0],[1.0])|
+--------------+--------------+-------------+-------------+
4、 MinMax标准化
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(df)
df_scaled = scalerModel.transform(df)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
df_scaled.select("features", "scaledFeatures").show()
Features scaled to range: [0.000000, 1.000000]
+--------------+--------------+
| features|scaledFeatures|
+--------------+--------------+
|[1.0,0.1,-1.0]| (3,[],[])|
| [2.0,1.1,1.0]| [0.5,0.1,0.5]|
|[3.0,10.1,3.0]| [1.0,1.0,1.0]|
+--------------+--------------+
5、MaxAbsScaler标准化
from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -8.0]),),
(1, Vectors.dense([2.0, 1.0, -4.0]),),
(2, Vectors.dense([4.0, 10.0, 8.0]),)
], ["id", "features"])
scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(df)
df_rescaled = scalerModel.transform(df)
df_rescaled.select("features", "scaledFeatures").show()
+--------------+--------------------+
| features| scaledFeatures|
+--------------+--------------------+
|[1.0,0.1,-8.0]|[0.25,0.010000000...|
|[2.0,1.0,-4.0]| [0.5,0.1,-0.5]|
|[4.0,10.0,8.0]| [1.0,1.0,1.0]|
+--------------+--------------------+
6、SQLTransformer
可以使用SQL语法将DataFrame进行转换,等效于注册表的作用。
但它可以用于Pipeline中作为Transformer.
from pyspark.ml.feature import SQLTransformer
df = spark.createDataFrame([
(0, 1.0, 3.0),
(2, 2.0, 5.0)
], ["id", "v1", "v2"])
sqlTrans = SQLTransformer(
statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")
sqlTrans.transform(df).show()
+---+---+---+---+----+
| id| v1| v2| v3| v4|
+---+---+---+---+----+
| 0|1.0|3.0|4.0| 3.0|
| 2|2.0|5.0|7.0|10.0|
+---+---+---+---+----+
7、 Imputer
Imputer转换器可以填充缺失值,缺失值可以用 float(“nan”)来表示。
from pyspark.ml.feature import Imputer
df = spark.createDataFrame([
(1.0, float("nan")),
(2.0, float("nan")),
(float("nan"), 3.0),
(4.0, 4.0),
(5.0, 5.0)
], ["a", "b"])
imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)
model.transform(df).show()
+---+---+-----+-----+
| a| b|out_a|out_b|
+---+---+-----+-----+
|1.0|NaN| 1.0| 4.0|
|2.0|NaN| 2.0| 4.0|
|NaN|3.0| 3.0| 3.0|
|4.0|4.0| 4.0| 4.0|
|5.0|5.0| 5.0| 5.0|
+---+---+-----+-----+