IDF
计算给定文档集合的逆文档频率 (IDF)
class pyspark.ml.feature.IDF(minDocFreq=0, inputCol=None, outputCol=None)
minDocFreq:一个术语应该出现在其中进行过滤的最小文档数
IDF是一种适合于数据集并生成IDFModel的estimator。IDFModel采用特征向量(通常由HashingTF或CountVectorizer创建)并缩放每一列。直观地说,它降低了语料库中经常出现的列的权重。
IDF的输入列为稀疏向量
01.导入模块,创建对象
from pyspark.sql import SparkSession
from pyspark.ml.feature import IDF, HashingTF, Tokenizer
from pyspark.ml.linalg import DenseVector
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("IDF").master("local[*]").getOrCreate()
02.创建数据
data = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
data.show()
输出结果:
+-----+--------------------+
|label| sentence|
+-----+--------------------+
| 0|Hi I heard about ...|
| 0|I wish Java could...|
| 1|Logistic regressi...|
+-----+--------------------+
03.使用分析器,将字符串数据拆分成一个个单词
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
data = tokenizer.transform(data)
04.使用hashingTF,将单词列的数据转成一个稀疏向量
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(data)
featurizedData.show()
输出结果:
+-----+--------------------+--------------------+--------------------+
|label| sentence| words| rawFeatures|
+-----+--------------------+--------------------+--------------------+
| 0|Hi I heard about ...|[hi, i, heard, ab...|(20,[0,5,9,17],[1...|
| 0|I wish Java could...|[i, wish, java, c...|(20,[2,7,9,13,15]...|
| 1|Logistic regressi...|[logistic, regres...|(20,[4,6,13,15,18...|
+-----+--------------------+--------------------+--------------------+
查看结构:
featurizedData.printSchema()
输出结果:
root
|-- label: long (nullable = true)
|-- sentence: string (nullable = true)
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
|-- rawFeatures: vector (nullable = true)
详细查看一行数据
featurizedData.head(1)
输出结果:
[Row(label=0, sentence='Hi I heard about Spark', words=['hi', 'i', 'heard', 'about', 'spark'], rawFeatures=SparseVector(20, {0: 1.0, 5: 1.0, 9: 1.0, 17: 2.0}))]
05.使用IDF,查看结果:
idf = IDF(inputCol="rawFeatures",outputCol="IDF",minDocFreq=4)
model = idf.fit(featurizedData)
IDFData = model.transform(featurizedData)
IDFData.show()
输出结果:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rYR9j4Av-1646685194925)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20220112110205609.png)]
06.查看IDF转换列的数据,和转换后的数据
IDFData.select("rawFeatures","IDF").head(3)
输出结果:
[Row(rawFeatures=SparseVector(20, {0: 1.0, 5: 1.0, 9: 1.0, 17: 2.0}), IDF=SparseVector(20, {0: 0.0, 5: 0.0, 9: 0.0, 17: 0.0})),
Row(rawFeatures=SparseVector(20, {2: 1.0, 7: 1.0, 9: 3.0, 13: 1.0, 15: 1.0}), IDF=SparseVector(20, {2: 0.0, 7: 0.0, 9: 0.0, 13: 0.0, 15: 0.0})),
Row(rawFeatures=SparseVector(20, {4: 1.0, 6: 1.0, 13: 1.0, 15: 1.0, 18: 1.0}), IDF=SparseVector(20, {4: 0.0, 6: 0.0, 13: 0.0, 15: 0.0, 18: 0.0}))]
tor(20, {4: 1.0, 6: 1.0, 13: 1.0, 15: 1.0, 18: 1.0}), IDF=SparseVector(20, {4: 0.0, 6: 0.0, 13: 0.0, 15: 0.0, 18: 0.0}))]