HashingTF
就是将一个document编码是一个长度为numFeatures的稀疏矩阵,并且在该稀疏矩阵中,所有矩阵元素之和为document的长度
HashingTF没有保留原有语料库中的原始的词语
01.导入模块创建对象
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("HashingTF").master("local[*]").getOrCreate()
02.创建数据
data = spark.createDataFrame([
(["I","am","zhangsan"],),
(["spark","is","perfect"],),
(["I","want","to","lraen","spark"],)
],["text"])
data.show()
data.printSchema()
输出结果:
+--------------------+
| text|
+--------------------+
| [I, am, zhangsan]|
|[spark, is, perfect]|
|[I, want, to, lra...|
+--------------------+
root
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
03.使用HashingTF,并查看结果
hashingTF = HashingTF(inputCol="text",outputCol="hashingTF_Res",numFeatures=9)
resHashTF = hashingTF.transform(data)
resHashTF.show()
输出结果:
+--------------------+--------------------+
| text| hashingTF_Res|
+--------------------+--------------------+
| [I, am, zhangsan]| (9,[1,6],[2.0,1.0])|
|[spark, is, perfect]|(9,[3,4,7],[1.0,1...|
|[I, want, to, lra...|(9,[0,4,6,7],[1.0...|
+--------------------+--------------------+
04.对比CountVectorizer,查看CountVectorizer的结果
from pyspark.ml.feature import CountVectorizer
countVectorizer = CountVectorizer(inputCol="text",outputCol="countVectorizer_RES")
model = countVectorizer.fit(data)
resHountVectorizer = model.transform(data)
resHountVectorizer.show()
输出结果:
+--------------------+--------------------+
| text| countVectorizer_RES|
+--------------------+--------------------+
| [I, am, zhangsan]|(9,[1,6,7],[1.0,1...|
|[spark, is, perfect]|(9,[0,3,4],[1.0,1...|
|[I, want, to, lra...|(9,[0,1,2,5,8],[1...|
+--------------------+--------------------+
05.详细查看一下HashingTF结果:
resHashTF.head(3)
输出结果:
[Row(text=['I', 'am', 'zhangsan'], hashingTF_Res=SparseVector(9, {1: 2.0, 6: 1.0})),
Row(text=['spark', 'is', 'perfect'], hashingTF_Res=SparseVector(9, {3: 1.0, 4: 1.0, 7: 1.0})),
Row(text=['I', 'want', 'to', 'lraen', 'spark'], hashingTF_Res=SparseVector(9, {0: 1.0, 4: 2.0, 6: 1.0, 7: 1.0}))]
HashingTF结果的解释:
1.numFeatures我们手动设置值为9
2.看一下源码中的说明:
Maps a sequence of terms to their term frequencies using the hashing trick.
Currently we use Austin Appleby’s MurmurHash 3 algorithm (MurmurHash3_x86_32)
to calculate the hash code value for the term object.
Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the numFeatures parameter;
otherwise the features will not be mapped evenly to the columns.
>>> df = spark.createDataFrame([([“a”, “b”, “c”],)], [“words”])
>>> hashingTF = HashingTF(numFeatures=10, inputCol=“words”, outputCol=“features”)
>>> hashingTF.transform(df).head().features
SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})
>>> hashingTF.setParams(outputCol=“freqs”).transform(df).head().freqs
SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})
>>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: “vector”}
>>> hashingTF.transform(df, params).head().vector
SparseVector(5, {0: 1.0, 1: 1.0, 2: 1.0})
>>> hashingTFPath = temp_path + “/hashing-tf”
>>> hashingTF.save(hashingTFPath)
>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
True
.. versionadded:: 1.3.0
3.将数据进行分桶,返回索引,再对应个数