spark HashingTF TFIDF怎样提取出词对应的TFIDF值

1.这个是spark官网的实例代码:

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat"))).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
  .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)rescaledData.select("label", "features").show()

因为数据量大的原因,HashingTF就是hash分桶,把词hash到有限的空间里,但是一般针对于小数据量的话,直接不用此方法,可以按照词的个数直接按照本身词的个数索引词,而不用hash后索引词,因为hash这步是不可逆的,没有保留词和hash后的index对应关系,所以可以采用CountVectorizer统计词的个数,然后,给每个词和其index ZIP一下编号。其实HashingTF的功能就是hash一个index和词匹配在一起,做到把每个词都有索引,可以自己实现。得到每个词的索引和词的对应关系后
每句话统计其词的频次,然后把词匹配索引表得到索引int类型和词频注意词频是double类型的,把每句话组成一个Sparse向量,注意这里需要把每句话里面的索引按从小到大排序,组成sparsevector,再去用IDF模型。

引用解决方法,有兴趣可以看:
Well, you can’t. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there.

If you’re using a large hash and number of unique tokens is relatively low then you can try to create a lookup table from bucket to possible tokens from your dataset. It is one-to-many mapping but if above conditions are met number of conflicts should be relatively low.

If you need a reversible transformation you can use combine Tokenizer and StringIndexer and build a sparse feature vector manually.

See also: What hashing function does Spark use for HashingTF and how do I duplicate it?
利用CountVectorizer得到每个词的索引PYTHON

from pyspark.ml.feature import CountVectorizer

df = sc.parallelize([
    (1, ["foo", "bar"]), (2, ["foo", "foobar", "baz"])
]).toDF(["id", "tokens"])

vectorizer = CountVectorizer(inputCol="tokens", outputCol="features").fit(df)
vectorizer.vocabulary
## ('foo', 'baz', 'bar', 'foobar')

利用CountVectorizer得到每个词的索引SCALA

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = sc.parallelize(Seq(
    (1, Seq("foo", "bar")), (2, Seq("foo", "foobar", "baz"))
)).toDF("id", "tokens")

val model: CountVectorizerModel = new CountVectorizer()
  .setInputCol("tokens")
  .setOutputCol("features")
  .fit(df)

model.vocabulary
// Array[String] = Array(foo, baz, bar, foobar)

我的解决方法:

val model: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").fit(rawData)

    val arr = model.vocabulary
    val len = arr.length
    val arr1 = arr.indices.toArray
    val ind = arr.zip(arr1).toMap
    val indx = arr1.zip(arr).toMap

    val wordIndMap = sc.broadcast(ind)
    val wordIndMapx = sc.broadcast(indx)
    import spark.implicits._
    val rddf = rawData.rdd.map(x => (x.getString(0),{
      var map : Map[Int, Double] = Map[Int, Double]()
      x.getAs[Seq[String]](1).foreach( t =>
      {
        val key = wordIndMap.value.getOrElse(t,-1)
        map.get(key) match {
          case Some(word) =>
            map += (key -> (word+1.0))
          case None =>  //不包含
            map += (key -> 1.0)
        }
      }
    )
    map
    }
    )).map(x => (x._1,{
      val t = x._2.toArray.sorted.unzip
      Vectors.sparse(len,t._1,t._2)
    })).toDF("user_log_acct","rawFeatures")

    // 求TF   此处不行。不符合最终想得到的每个词的TFIDF值,因为其索引根本不知道是哪个词
    /*println("featurizedData----------------")
    val hashingTF = new HashingTF()
      .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(2000) // 设置哈希表的桶数为2000,即特征维度


    val featurizedData = hashingTF.transform(rawData)
    featurizedData.show(3)*/
    // 求IDF
    println("recaledData----------------")
    val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
    val idfModel = idf.fit(rddf)
    var rescaledData = idfModel.transform(rddf)
    rescaledData = rescaledData.select("user_log_acct", "features")


    val resdata = rescaledData.rdd.map(x => (x.getAs[String]("user_log_acct"),{
      val vec = x.getAs[Vector]("features").toSparse
      val indix = vec.indices.map(x => wordIndMapx.value.getOrElse(x,"NULL"))
      indix.zip(vec.values)
    })).map(x => {
      val pin = x._1
      x._2.map(y => (pin,y._1,y._2))
    }).flatMap(x => x).toDF("user_log_acct","cate_cd","score")
    insertIntoHive(spark,resdata,getYesterday) //插入数据hive表
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值