Pyspark特征工程--IDF

IDF

计算给定文档集合的逆文档频率 (IDF)

class pyspark.ml.feature.IDF(minDocFreq=0, inputCol=None, outputCol=None)

minDocFreq:一个术语应该出现在其中进行过滤的最小文档数

IDF是一种适合于数据集并生成IDFModel的estimator。IDFModel采用特征向量(通常由HashingTF或CountVectorizer创建)并缩放每一列。直观地说,它降低了语料库中经常出现的列的权重。

​ IDF的输入列为稀疏向量

01.导入模块,创建对象

from pyspark.sql import SparkSession
from pyspark.ml.feature import IDF, HashingTF, Tokenizer
from pyspark.ml.linalg import DenseVector
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("IDF").master("local[*]").getOrCreate()

02.创建数据

data = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (0, "I wish Java could use case classes"),
    (1, "Logistic regression models are neat")
], ["label", "sentence"])
data.show()

​ 输出结果:

+-----+--------------------+
|label|            sentence|
+-----+--------------------+
|    0|Hi I heard about ...|
|    0|I wish Java could...|
|    1|Logistic regressi...|
+-----+--------------------+

03.使用分析器,将字符串数据拆分成一个个单词

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
data = tokenizer.transform(data)

04.使用hashingTF,将单词列的数据转成一个稀疏向量

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(data)
featurizedData.show()

​ 输出结果:

+-----+--------------------+--------------------+--------------------+
|label|            sentence|               words|         rawFeatures|
+-----+--------------------+--------------------+--------------------+
|    0|Hi I heard about ...|[hi, i, heard, ab...|(20,[0,5,9,17],[1...|
|    0|I wish Java could...|[i, wish, java, c...|(20,[2,7,9,13,15]...|
|    1|Logistic regressi...|[logistic, regres...|(20,[4,6,13,15,18...|
+-----+--------------------+--------------------+--------------------+

​ 查看结构:

featurizedData.printSchema()

​ 输出结果:

root
 |-- label: long (nullable = true)
 |-- sentence: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- rawFeatures: vector (nullable = true)

​ 详细查看一行数据

featurizedData.head(1)

​ 输出结果:

[Row(label=0, sentence='Hi I heard about Spark', words=['hi', 'i', 'heard', 'about', 'spark'], rawFeatures=SparseVector(20, {0: 1.0, 5: 1.0, 9: 1.0, 17: 2.0}))]

05.使用IDF,查看结果:

idf = IDF(inputCol="rawFeatures",outputCol="IDF",minDocFreq=4)
model = idf.fit(featurizedData)
IDFData = model.transform(featurizedData)
IDFData.show()

​ 输出结果:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rYR9j4Av-1646685194925)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20220112110205609.png)]

06.查看IDF转换列的数据,和转换后的数据

IDFData.select("rawFeatures","IDF").head(3)

​ 输出结果:

[Row(rawFeatures=SparseVector(20, {0: 1.0, 5: 1.0, 9: 1.0, 17: 2.0}), IDF=SparseVector(20, {0: 0.0, 5: 0.0, 9: 0.0, 17: 0.0})),
 Row(rawFeatures=SparseVector(20, {2: 1.0, 7: 1.0, 9: 3.0, 13: 1.0, 15: 1.0}), IDF=SparseVector(20, {2: 0.0, 7: 0.0, 9: 0.0, 13: 0.0, 15: 0.0})),
 Row(rawFeatures=SparseVector(20, {4: 1.0, 6: 1.0, 13: 1.0, 15: 1.0, 18: 1.0}), IDF=SparseVector(20, {4: 0.0, 6: 0.0, 13: 0.0, 15: 0.0, 18: 0.0}))]

tor(20, {4: 1.0, 6: 1.0, 13: 1.0, 15: 1.0, 18: 1.0}), IDF=SparseVector(20, {4: 0.0, 6: 0.0, 13: 0.0, 15: 0.0, 18: 0.0}))]


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值