spark HashingTF TFIDF怎样提取出词对应的TFIDF值

最新推荐文章于 2024-02-17 08:42:30 发布

MENG哥

最新推荐文章于 2024-02-17 08:42:30 发布

阅读量2.1k

点赞数 1

分类专栏：数据挖掘大数据

本文链接：https://blog.csdn.net/u014552678/article/details/105288001

版权

数据挖掘同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

大数据

3 篇文章 0 订阅

订阅专栏

1.这个是spark官网的实例代码：

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat"))).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)rescaledData.select("label", "features").show()

因为数据量大的原因，HashingTF就是hash分桶，把词hash到有限的空间里，但是一般针对于小数据量的话，直接不用此方法，可以按照词的个数直接按照本身词的个数索引词，而不用hash后索引词，因为hash这步是不可逆的，没有保留词和hash后的index对应关系，所以可以采用CountVectorizer统计词的个数，然后，给每个词和其index ZIP一下编号。其实HashingTF的功能就是hash一个index和词匹配在一起，做到把每个词都有索引，可以自己实现。得到每个词的索引和词的对应关系后
每句话统计其词的频次，然后把词匹配索引表得到索引int类型和词频注意词频是double类型的，把每句话组成一个Sparse向量，注意这里需要把每句话里面的索引按从小到大排序，组成sparsevector，再去用IDF模型。

引用解决方法，有兴趣可以看：
Well, you can’t. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there.

If you’re using a large hash and number of unique tokens is relatively low then you can try to create a lookup table from bucket to possible tokens from your dataset. It is one-to-many mapping but if above conditions are met number of conflicts should be relatively low.

If you need a reversible transformation you can use combine Tokenizer and StringIndexer and build a sparse feature vector manually.

See also: What hashing function does Spark use for HashingTF and how do I duplicate it?
利用CountVectorizer得到每个词的索引PYTHON

from pyspark.ml.feature import CountVectorizer

df = sc.parallelize([
    (1, ["foo", "bar"]), (2, ["foo", "foobar", "baz"])
]).toDF(["id", "tokens"])

vectorizer = CountVectorizer(inputCol="tokens", outputCol="features").fit(df)
vectorizer.vocabulary
## ('foo', 'baz', 'bar', 'foobar')

利用CountVectorizer得到每个词的索引SCALA

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = sc.parallelize(Seq(
    (1, Seq("foo", "bar")), (2, Seq("foo", "foobar", "baz"))
)).toDF("id", "tokens")

val model: CountVectorizerModel = new CountVectorizer()
  .setInputCol("tokens")
  .setOutputCol("features")
  .fit(df)

model.vocabulary
// Array[String] = Array(foo, baz, bar, foobar)

我的解决方法：

val model: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").fit(rawData)

    val arr = model.vocabulary
    val len = arr.length
    val arr1 = arr.indices.toArray
    val ind = arr.zip(arr1).toMap
    val indx = arr1.zip(arr).toMap

    val wordIndMap = sc.broadcast(ind)
    val wordIndMapx = sc.broadcast(indx)
    import spark.implicits._
    val rddf = rawData.rdd.map(x => (x.getString(0),{
      var map : Map[Int, Double] = Map[Int, Double]()
      x.getAs[Seq[String]](1).foreach( t =>
      {
        val key = wordIndMap.value.getOrElse(t,-1)
        map.get(key) match {
          case Some(word) =>
            map += (key -> (word+1.0))
          case None =>  //不包含
            map += (key -> 1.0)
        }
      }
    )
    map
    }
    )).map(x => (x._1,{
      val t = x._2.toArray.sorted.unzip
      Vectors.sparse(len,t._1,t._2)
    })).toDF("user_log_acct","rawFeatures")

    // 求TF   此处不行。不符合最终想得到的每个词的TFIDF值，因为其索引根本不知道是哪个词
    /*println("featurizedData----------------")
    val hashingTF = new HashingTF()
      .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(2000) // 设置哈希表的桶数为2000，即特征维度


    val featurizedData = hashingTF.transform(rawData)
    featurizedData.show(3)*/
    // 求IDF
    println("recaledData----------------")
    val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
    val idfModel = idf.fit(rddf)
    var rescaledData = idfModel.transform(rddf)
    rescaledData = rescaledData.select("user_log_acct", "features")


    val resdata = rescaledData.rdd.map(x => (x.getAs[String]("user_log_acct"),{
      val vec = x.getAs[Vector]("features").toSparse
      val indix = vec.indices.map(x => wordIndMapx.value.getOrElse(x,"NULL"))
      indix.zip(vec.values)
    })).map(x => {
      val pin = x._1
      x._2.map(y => (pin,y._1,y._2))
    }).flatMap(x => x).toDF("user_log_acct","cate_cd","score")
    insertIntoHive(spark,resdata,getYesterday) //插入数据hive表

MENG哥

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
4
评论
spark HashingTF TFIDF怎样提取出词对应的TFIDF值

1.这个是spark官网的实例代码：import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}val sentenceData = spark.createDataFrame(Seq( (0.0, "Hi I heard about Spark"), (0.0, "I wish Java could use case c...
复制链接

扫一扫