examples / Dataset Wordcount

最新推荐文章于 2022-07-11 16:57:57 发布

奔跑-起点

最新推荐文章于 2022-07-11 16:57:57 发布

阅读量1.4k

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/bbaiggey/article/details/52330379

版权

spark 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Wordcount.html

In this example, we take lines of text and split them up into words. Next, we count the number of occurances of each work in the set using a variety of Spark API.

>

dbutils.fs.put("/home/spark/1.6/lines","""
Hello hello world
Hello how are you world
""", true)
Wrote 43 bytes.
res0: Boolean = true
>

import org.apache.spark.sql.functions._

// Load a text file and interpret each line as a java.lang.String
val ds = sqlContext.read.text("/home/spark/1.6/lines").as[String]
val result = ds
.flatMap(_.split(" ")) // Split on whitespace
.filter(_ !="\"\"" ) // Filter empty words
.toDF() // Convert to DataFrame to perform aggregation / sorting
.groupBy($"value") // Count number of occurences of each word
.agg(count("*") as "numOccurances")
.orderBy($"numOccurances" desc) // Show most common words first

display(result)
world 2
Hello 2
are 1
hello 1
how 1
you 1
value numOccurances
It is also possible to perform the aggregation in pure scala, instead of switching to DataFrames. In the following example, we perform the same wordcount, normalizing the case of the word (i.e. group "hello" and "Hello" together)
>

val wordCount =
ds
.flatMap(_.split(" "))
.filter(_ !="\"\"" )
.groupBy(_.toLowerCase()) // Instead of grouping on a column expression (i.e. $"value") we pass a lambda function
.count()

display(wordCount.toDF())
are 1
hello 3
how 1
world 2
you 1