https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Wordcount.html
In this example, we take lines of text and split them up into words. Next, we count the number of occurances of each work in the set using a variety of Spark API.
>dbutils.fs.put("/home/spark/1.6/lines","""
Hello hello world
Hello how are you world
""", true)
Wrote 43 bytes.
res0: Boolean = true
>
import org.apache.spark.sql.functions._
// Load a text file and interpret each line as a java.lang.String
val ds = sqlContext.read.text("/home/spark/1.6/lines").as[String]
val result = ds
.flatMap(_.split(" ")) // Split on whitespace
.filter(_ !="\"\"" ) // Filter empty words
.toDF() // Convert to DataFrame to perform aggregation / sorting
.groupBy($"value") // Count number of occurences of each word
.agg(count("*") as "numOccurances")
.orderBy($"numOccurances" desc) // Show most common words first
display(result)
world 2
Hello 2
are 1
hello 1
how 1
you 1
value numOccurances
It is also possible to perform the aggregation in pure scala, instead of switching to DataFrames. In the following example, we perform the same wordcount, normalizing the case of the word (i.e. group "hello" and "Hello" together)
>
val wordCount =
ds
.flatMap(_.split(" "))
.filter(_ !="\"\"" )
.groupBy(_.toLowerCase()) // Instead of grouping on a column expression (i.e. $"value") we pass a lambda function
.count()
display(wordCount.toDF())
are 1
hello 3
how 1
world 2
you 1