spark高级文本处理技术--spark-Machine Learning With Spark

最新推荐文章于 2020-11-23 14:09:54 发布

xiaokekehaha19

最新推荐文章于 2020-11-23 14:09:54 发布

阅读量712

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/xiaokekehaha19/article/details/48782557

版权

spark 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

数据下载地址

http://kdd.ics.uci.edu/
databases/20newsgroups/20newsgroups.data.html.
20news-bydate.tar.gz
解压这个文件

里面有很多文件夹

加载数据

  val path="/zhouxiaoke/20news-bydate-train/*"
  val rdd =sc.wholeTextFiles(path)
  rdd.take(1)
  val newsgroups = rdd.map { case (file, text) => file.split("/").takeRight(2).head }

<pre name="code" class="java">newsgroups提取的是文本文件的名字

val text = rdd.map { case (file, text) => text }

println(text.count)// 11314

 val countByGroup=newsgroups.map(n=>(n,1)).reduceByKey(_+_).collect().sortBy(_._2).mkString(",")
  val text=rdd.map{case(file,text)=>
  text
  }

输出的结果下面代表各个主体下的文件的数量

(rec.sport.hockey,600)
(soc.religion.christian,599)
(rec.motorcycles,598)
(rec.sport.baseball,597)
(sci.crypt,595)
(rec.autos,594)
(sci.med,594)
(comp.windows.x,593

现在开始对文本进行分词

val whiteSpaceSplit = text.flatMap(t => t.split(" ").map(_.toLowerCase))

现在取出的数据里面是有很多杂乱数据，现在去掉一些数字，标点，以及文字包含数字文字出现比较少的数据

val regex="[^0-9]*".r

val stopwords = Set(
  "the","a","an","of","or","in","for","by","on","but", "is", "not", "with", "as", "was", "if",
  "they", "are", "this", "and", "it", "have", "from", "at", "my", "be", "that", "to"
)

val rareTokens=tokenCounts.filter{case(k,v)=>
    v<2
}.map(_._1).collect().toSet
val tokenCountsFilteredAll=tokenCountsFilteredSize.filter{case (k,v)=>
!rareTokens.contains(k)
}

def tokenize(line: String): Seq[String] = {
  line.split("""\W+""")
    .map(_.toLowerCase)
    .filter(token => regex.pattern.matcher(token).matches)
    .filterNot(token => stopwords.contains(token))
    .filterNot(token => rareTokens.contains(token))
    .filter(token => token.size >= 2)
    .toSeq
}

val text=rdd.map{case(file,text)=>
text
}

下面的tokens就是处理好后的文本数据了

val tokens = text.map(doc => tokenize(doc))

处理后怎么做。。。下次再分享吧。。吃饭去了。。。

xiaokekehaha19

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark高级文本处理技术--spark-Machine Learning With Spark

数据下载地址http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.data.html.20news-bydate.tar.gz解压这个文件里面有很多文件夹加载数据 val path="/zhouxiaoke/20news-bydate-train/*" val rdd =sc.wholeTextFil
复制链接

扫一扫