测试数据源:20 Newsgroups (http://qwone.com/~jason/20Newsgroups/),其中包含20个领域的新闻,此次我们使用20news-bydate-train作为测试数据,其结构如下
Spark Task:
对多篇文章提取其特征关键字以备检索、分类使用(关键字视为一个单词)
输入内容文件格式
(article_id,content...)
(article_id,content...)
(article_id,content...)
要求输出格式
(article_id,文章前20个特征关键字)
The Question to be solved:
1.虽然MLib提供了TF-IDF的实现,但是文章id无法跟踪.(提示:使用wholefile和zip函数)
2.MLib输出的结果是该文章所有单词对于的TF-IDF,格式必得转换
实现代码
val stopwords = Set(
"the","a","an","of","or","in","for","by","on","but", "is", "not", "with", "as", "was", "if",
"they", "are", "this", "and", "it", "have", "from", "at", "my", "be", "that", "to","what","which"
)
val regexNum="[^0-9]*".r
def tokenize(content:String):Seq[String]={
content.split("\\W+")
.map(_.toLowerCase)
.filter(regexNum.pattern.matcher(_).matches) //filter the word included number
.filterNot(stopwords.contains) //filter the stopped words
.filter(_.length>2) //filter the word that it's length less than 2
.toSeq
}
def doTest(){
val sc=new SparkContext
val path="hdfs://1.185.74.124:9000/javis/20news-bydate-train/*"
val rdd=sc.wholeTextFiles(path)
val titls=rdd.map(_._1)
val documents= rdd.map(_._2).map(tokenize)
val hashingTF = new HashingTF()
val mapWords=documents.flatMap(x=>x).map(w=>(hashingTF.indexOf(w),w)).collect.toMap
val tf=hashingTF.transform(documents)
val bcWords=tf.context.broadcast(mapWords)
tf.cache()
val idf = new IDF(2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val r = tfidf.map{
case SparseVector(size, indices, values)=>
val words=indices.map(index=>bcWords.value.getOrElse(index,"null"))
words.zip(values).sortBy(-_._2).take(20).toSeq
}
titls.zip(r).saveAsTextFile("hdfs://1.185.74.124:9000/20new_result_"+System.currentTimeMillis)
}
代码分析:
1.通过wholeTextFiles获得20news-bydate-train下的所有文件信息,其格式(fileName,content)
2.由rdd获得其文件名title和内容documets,对于每个document我们进行一些处理,比如提取其中的单词,过滤掉部分停用词和单词长度为1的词等等。如何把其转换为Seq类型,具体参看函数tokenize
3.记录所有单词的下标映射,这里我们使用一个mapWord:Map[Int,String]来保存这层关系。
4.使用MLib提供的函数获得对应的tfid,此时它是一个Vector。使用之前的mapWord把对应的index转化成具体的单词。格式(单词,tfidf权重)
5.使用zip函数,组合对应的title和对应内容的(单词,tfidf权重),此时这里的title和r都是通过同一个rdd的map方法得到的,其分区保存不变,所以可以进行zip操作
最终的部分输出内容如下:
(file:/home/javis/Documents/20news-bydate-train/rec.autos/102860,WrappedArray((unisql,12.311661455439385), (tek,9.126399867203945), (frip,7.387974409012324).....
(file:/home/javis/Documents/20news-bydate-train/rec.autos/103330,WrappedArray((cutting,21.306205491340666),(hou,18.22736406013847),(lehigh,15.872499870699261).....