Pyspark特征工程--HashingTF

HashingTF

就是将一个document编码是一个长度为numFeatures的稀疏矩阵,并且在该稀疏矩阵中,所有矩阵元素之和为document的长度

HashingTF没有保留原有语料库中的原始的词语

01.导入模块创建对象

from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("HashingTF").master("local[*]").getOrCreate()

02.创建数据

data = spark.createDataFrame([
    (["I","am","zhangsan"],),
    (["spark","is","perfect"],),
    (["I","want","to","lraen","spark"],)
],["text"])
data.show()
data.printSchema()

​ 输出结果:

+--------------------+
|                text|
+--------------------+
|   [I, am, zhangsan]|
|[spark, is, perfect]|
|[I, want, to, lra...|
+--------------------+

root
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)

03.使用HashingTF,并查看结果

hashingTF = HashingTF(inputCol="text",outputCol="hashingTF_Res",numFeatures=9)
resHashTF = hashingTF.transform(data)
resHashTF.show()

​ 输出结果:

+--------------------+--------------------+
|                text|       hashingTF_Res|
+--------------------+--------------------+
|   [I, am, zhangsan]| (9,[1,6],[2.0,1.0])|
|[spark, is, perfect]|(9,[3,4,7],[1.0,1...|
|[I, want, to, lra...|(9,[0,4,6,7],[1.0...|
+--------------------+--------------------+

04.对比CountVectorizer,查看CountVectorizer的结果

from pyspark.ml.feature import CountVectorizer
countVectorizer = CountVectorizer(inputCol="text",outputCol="countVectorizer_RES")
model = countVectorizer.fit(data)
resHountVectorizer = model.transform(data)
resHountVectorizer.show()

​ 输出结果:

+--------------------+--------------------+
|                text| countVectorizer_RES|
+--------------------+--------------------+
|   [I, am, zhangsan]|(9,[1,6,7],[1.0,1...|
|[spark, is, perfect]|(9,[0,3,4],[1.0,1...|
|[I, want, to, lra...|(9,[0,1,2,5,8],[1...|
+--------------------+--------------------+

05.详细查看一下HashingTF结果:

resHashTF.head(3)

​ 输出结果:

[Row(text=['I', 'am', 'zhangsan'], hashingTF_Res=SparseVector(9, {1: 2.0, 6: 1.0})),
 Row(text=['spark', 'is', 'perfect'], hashingTF_Res=SparseVector(9, {3: 1.0, 4: 1.0, 7: 1.0})),
 Row(text=['I', 'want', 'to', 'lraen', 'spark'], hashingTF_Res=SparseVector(9, {0: 1.0, 4: 2.0, 6: 1.0, 7: 1.0}))]

HashingTF结果的解释:

1.numFeatures我们手动设置值为9

2.看一下源码中的说明:

​ Maps a sequence of terms to their term frequencies using the hashing trick.
​ Currently we use Austin Appleby’s MurmurHash 3 algorithm (MurmurHash3_x86_32)
​ to calculate the hash code value for the term object.
​ Since a simple modulo is used to transform the hash function to a column index,
​ it is advisable to use a power of two as the numFeatures parameter;
​ otherwise the features will not be mapped evenly to the columns.

​ >>> df = spark.createDataFrame([([“a”, “b”, “c”],)], [“words”])
​ >>> hashingTF = HashingTF(numFeatures=10, inputCol=“words”, outputCol=“features”)
>>> hashingTF.transform(df).head().features
​ SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})
​ >>> hashingTF.setParams(outputCol=“freqs”).transform(df).head().freqs
​ SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})
​ >>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: “vector”}
​ >>> hashingTF.transform(df, params).head().vector
​ SparseVector(5, {0: 1.0, 1: 1.0, 2: 1.0})
​ >>> hashingTFPath = temp_path + “/hashing-tf”
​ >>> hashingTF.save(hashingTFPath)
​ >>> loadedHashingTF = HashingTF.load(hashingTFPath)
​ >>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
​ True

 .. versionadded:: 1.3.0

3.将数据进行分桶,返回索引,再对应个数

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值