Pyspark特征工程--HashingTF

最新推荐文章于 2024-04-07 11:30:00 发布

Gadaite

最新推荐文章于 2024-04-07 11:30:00 发布

阅读量929

点赞数

文章标签： spark 大数据数据挖掘

本文链接：https://blog.csdn.net/weixin_46408961/article/details/123343883

版权

HashingTF

就是将一个document编码是一个长度为numFeatures的稀疏矩阵，并且在该稀疏矩阵中，所有矩阵元素之和为document的长度

HashingTF没有保留原有语料库中的原始的词语

01.导入模块创建对象

from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("HashingTF").master("local[*]").getOrCreate()

02.创建数据

data = spark.createDataFrame([
    (["I","am","zhangsan"],),
    (["spark","is","perfect"],),
    (["I","want","to","lraen","spark"],)
],["text"])
data.show()
data.printSchema()

输出结果：

+--------------------+
|                text|
+--------------------+
|   [I, am, zhangsan]|
|[spark, is, perfect]|
|[I, want, to, lra...|
+--------------------+

root
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)

03.使用HashingTF，并查看结果

hashingTF = HashingTF(inputCol="text",outputCol="hashingTF_Res",numFeatures=9)
resHashTF = hashingTF.transform(data)
resHashTF.show()

输出结果：

+--------------------+--------------------+
|                text|       hashingTF_Res|
+--------------------+--------------------+
|   [I, am, zhangsan]| (9,[1,6],[2.0,1.0])|
|[spark, is, perfect]|(9,[3,4,7],[1.0,1...|
|[I, want, to, lra...|(9,[0,4,6,7],[1.0...|
+--------------------+--------------------+

04.对比CountVectorizer，查看CountVectorizer的结果

from pyspark.ml.feature import CountVectorizer
countVectorizer = CountVectorizer(inputCol="text",outputCol="countVectorizer_RES")
model = countVectorizer.fit(data)
resHountVectorizer = model.transform(data)
resHountVectorizer.show()

输出结果：

+--------------------+--------------------+
|                text| countVectorizer_RES|
+--------------------+--------------------+
|   [I, am, zhangsan]|(9,[1,6,7],[1.0,1...|
|[spark, is, perfect]|(9,[0,3,4],[1.0,1...|
|[I, want, to, lra...|(9,[0,1,2,5,8],[1...|
+--------------------+--------------------+

05.详细查看一下HashingTF结果：

resHashTF.head(3)

输出结果：

[Row(text=['I', 'am', 'zhangsan'], hashingTF_Res=SparseVector(9, {1: 2.0, 6: 1.0})),
 Row(text=['spark', 'is', 'perfect'], hashingTF_Res=SparseVector(9, {3: 1.0, 4: 1.0, 7: 1.0})),
 Row(text=['I', 'want', 'to', 'lraen', 'spark'], hashingTF_Res=SparseVector(9, {0: 1.0, 4: 2.0, 6: 1.0, 7: 1.0}))]

HashingTF结果的解释：

1.numFeatures我们手动设置值为9

2.看一下源码中的说明：

Maps a sequence of terms to their term frequencies using the hashing trick.
Currently we use Austin Appleby’s MurmurHash 3 algorithm (MurmurHash3_x86_32)
to calculate the hash code value for the term object.
Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the numFeatures parameter;
otherwise the features will not be mapped evenly to the columns.

>>> df = spark.createDataFrame([([“a”, “b”, “c”],)], [“words”])
>>> hashingTF = HashingTF(numFeatures=10, inputCol=“words”, outputCol=“features”)
>>> hashingTF.transform(df).head().features
SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})
>>> hashingTF.setParams(outputCol=“freqs”).transform(df).head().freqs
SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})
>>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: “vector”}
>>> hashingTF.transform(df, params).head().vector
SparseVector(5, {0: 1.0, 1: 1.0, 2: 1.0})
>>> hashingTFPath = temp_path + “/hashing-tf”
>>> hashingTF.save(hashingTFPath)
>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
True