1.这个是spark官网的实例代码:
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat"))).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)rescaledData.select("label", "features").show()
因为数据量大的原因,HashingTF就是hash分桶,把词hash到有限的空间里,但是一般针对于小数据量的话,直接不用此方法,可以按照词的个数直接按照本身词的个数索引词,而不用hash后索引词,因为hash这步是不可逆的,没有保留词和hash后的index对应关系,所以可以采用CountVectorizer统计词的个数,然后,给每个词和其index ZIP一下编号。其实HashingTF的功能就是hash一个index和词匹配在一起,做到把每个词都有索引,可以自己实现。得到每个词的索引和词的对应关系后
每句话统计其词的频次,然后把词匹配索引表得到索引int类型和词频注意词频是double类型的,把每句话组成一个Sparse向量,注意这里需要把每句话里面的索引按从小到大排序,组成sparsevector,再去用IDF模型。
引用解决方法,有兴趣可以看:
Well, you can’t. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there.
If you’re using a large hash and number of unique tokens is relatively low then you can try to create a lookup table from bucket to possible tokens from your dataset. It is one-to-many mapping but if above conditions are met number of conflicts should be relatively low.
If you need a reversible transformation you can use combine Tokenizer and StringIndexer and build a sparse feature vector manually.
See also: What hashing function does Spark use for HashingTF and how do I duplicate it?
利用CountVectorizer得到每个词的索引PYTHON
from pyspark.ml.feature import CountVectorizer
df = sc.parallelize([
(1, ["foo", "bar"]), (2, ["foo", "foobar", "baz"])
]).toDF(["id", "tokens"])
vectorizer = CountVectorizer(inputCol="tokens", outputCol="features").fit(df)
vectorizer.vocabulary
## ('foo', 'baz', 'bar', 'foobar')
利用CountVectorizer得到每个词的索引SCALA
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = sc.parallelize(Seq(
(1, Seq("foo", "bar")), (2, Seq("foo", "foobar", "baz"))
)).toDF("id", "tokens")
val model: CountVectorizerModel = new CountVectorizer()
.setInputCol("tokens")
.setOutputCol("features")
.fit(df)
model.vocabulary
// Array[String] = Array(foo, baz, bar, foobar)
我的解决方法:
val model: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").fit(rawData)
val arr = model.vocabulary
val len = arr.length
val arr1 = arr.indices.toArray
val ind = arr.zip(arr1).toMap
val indx = arr1.zip(arr).toMap
val wordIndMap = sc.broadcast(ind)
val wordIndMapx = sc.broadcast(indx)
import spark.implicits._
val rddf = rawData.rdd.map(x => (x.getString(0),{
var map : Map[Int, Double] = Map[Int, Double]()
x.getAs[Seq[String]](1).foreach( t =>
{
val key = wordIndMap.value.getOrElse(t,-1)
map.get(key) match {
case Some(word) =>
map += (key -> (word+1.0))
case None => //不包含
map += (key -> 1.0)
}
}
)
map
}
)).map(x => (x._1,{
val t = x._2.toArray.sorted.unzip
Vectors.sparse(len,t._1,t._2)
})).toDF("user_log_acct","rawFeatures")
// 求TF 此处不行。不符合最终想得到的每个词的TFIDF值,因为其索引根本不知道是哪个词
/*println("featurizedData----------------")
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(2000) // 设置哈希表的桶数为2000,即特征维度
val featurizedData = hashingTF.transform(rawData)
featurizedData.show(3)*/
// 求IDF
println("recaledData----------------")
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rddf)
var rescaledData = idfModel.transform(rddf)
rescaledData = rescaledData.select("user_log_acct", "features")
val resdata = rescaledData.rdd.map(x => (x.getAs[String]("user_log_acct"),{
val vec = x.getAs[Vector]("features").toSparse
val indix = vec.indices.map(x => wordIndMapx.value.getOrElse(x,"NULL"))
indix.zip(vec.values)
})).map(x => {
val pin = x._1
x._2.map(y => (pin,y._1,y._2))
}).flatMap(x => x).toDF("user_log_acct","cate_cd","score")
insertIntoHive(spark,resdata,getYesterday) //插入数据hive表