Spark机器学习：TF-IDF实例讲解

最新推荐文章于 2024-02-17 08:42:30 发布

Javis486

最新推荐文章于 2024-02-17 08:42:30 发布

阅读量9.9k

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/jiangpeng59/article/details/52786344

版权

Spark 专栏收录该内容

38 篇文章 5 订阅

订阅专栏

测试数据源：20 Newsgroups (http://qwone.com/~jason/20Newsgroups/)，其中包含20个领域的新闻，此次我们使用20news-bydate-train作为测试数据，其结构如下

Spark Task:

对多篇文章提取其特征关键字以备检索、分类使用(关键字视为一个单词)

输入内容文件格式

(article_id，content...)

要求输出格式

(article_id,文章前20个特征关键字)

The Question to be solved：

1.虽然MLib提供了TF-IDF的实现,但是文章id无法跟踪.(提示:使用wholefile和zip函数)

2.MLib输出的结果是该文章所有单词对于的TF-IDF,格式必得转换

实现代码

val stopwords = Set(
    "the","a","an","of","or","in","for","by","on","but", "is", "not", "with", "as", "was", "if",
    "they", "are", "this", "and", "it", "have", "from", "at", "my", "be", "that", "to","what","which"
  )

  val regexNum="[^0-9]*".r

  def tokenize(content:String):Seq[String]={
    content.split("\\W+")
      .map(_.toLowerCase)
      .filter(regexNum.pattern.matcher(_).matches) //filter the word included number
      .filterNot(stopwords.contains) //filter the stopped words
      .filter(_.length>2) //filter the word that it's length less than 2
      .toSeq
  }


  def doTest(){
    val sc=new SparkContext
        val path="hdfs://1.185.74.124:9000/javis/20news-bydate-train/*"
        val rdd=sc.wholeTextFiles(path)
        val titls=rdd.map(_._1)
        val documents= rdd.map(_._2).map(tokenize)
        val hashingTF = new HashingTF()
        val mapWords=documents.flatMap(x=>x).map(w=>(hashingTF.indexOf(w),w)).collect.toMap
        val tf=hashingTF.transform(documents)
        val bcWords=tf.context.broadcast(mapWords)
        tf.cache()
        val idf = new IDF(2).fit(tf)
        val tfidf: RDD[Vector] = idf.transform(tf)
        val r = tfidf.map{
          case SparseVector(size, indices, values)=>
            val words=indices.map(index=>bcWords.value.getOrElse(index,"null"))
            words.zip(values).sortBy(-_._2).take(20).toSeq
        }
        titls.zip(r).saveAsTextFile("hdfs://1.185.74.124:9000/20new_result_"+System.currentTimeMillis)
  }

代码分析：

1.通过wholeTextFiles获得20news-bydate-train下的所有文件信息，其格式(fileName,content)

2.由rdd获得其文件名title和内容documets,对于每个document我们进行一些处理，比如提取其中的单词，过滤掉部分停用词和单词长度为1的词等等。如何把其转换为Seq类型，具体参看函数tokenize

3.记录所有单词的下标映射，这里我们使用一个mapWord:Map[Int,String]来保存这层关系。

4.使用MLib提供的函数获得对应的tfid,此时它是一个Vector。使用之前的mapWord把对应的index转化成具体的单词。格式(单词，tfidf权重)

5.使用zip函数，组合对应的title和对应内容的(单词，tfidf权重)，此时这里的title和r都是通过同一个rdd的map方法得到的，其分区保存不变，所以可以进行zip操作

最终的部分输出内容如下：

(file:/home/javis/Documents/20news-bydate-train/rec.autos/102860,WrappedArray((unisql,12.311661455439385), (tek,9.126399867203945), (frip,7.387974409012324).....

(file:/home/javis/Documents/20news-bydate-train/rec.autos/103330,WrappedArray((cutting,21.306205491340666),(hou,18.22736406013847),(lehigh,15.872499870699261).....

Javis486

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
3
评论
Spark机器学习：TF-IDF实例讲解

测试数据源：20 Newsgroups (http://qwone.com/~jason/20Newsgroups/)，其中包含20个领域的新闻，此次我们使用20news-bydate-train作为测试数据，其结构如下Spark Task:对多篇文章提取其特征关键字以备检索、分类使用(关键字视为一个单词)输入内容文件格式(article_id，content.
复制链接

扫一扫