1、创建SparkSession
val sparkSession = SparkSession.builder()
.appName("SparkWordCount")
.master("local[2]")
.getOrCreate()
2、加载数据,使用dataset处理数据集
read来读取可以直接返回DataSet[String],这是个比RDD更高级的数据集
它返回一个列名为value的视图
+------------+
| value|
+------------+
|hello tunter|
| hello tony|
| tony hunter|
+------------+
val datas:DataSet[String] =
sparkSession.read.textFile("hdfs://192.168.252.121:9000/words.txt")
3、切分数据
//导入隐式
import sparkSession.implicits._
val word:DataSet[String]=datas.flatMap(_.split(" "))
4、注册视图
word.createTempView("wc_t")
5、执行sql
val df:DataFrame = sparkSession.sql("select value, count(*) sum from wc_t
group by value order by sum desc")
df.show()
sparkSession.stop()
执行结果:
+------+---+
| value|sum|
+------+---+
| hello| 2|
| tony| 2|
|hunter| 1|
|tunter| 1|
+------+---+
完整代码:
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
object SparkSqlWordCount {
def main(args: Array[String]): Unit = {
//1.创建sparksession
val sparkSession = SparkSession.builder()
.appName("SparkSqlWordCount")
.master("local[2]")
.getOrCreate()
//2.加载数据 使用dataset处理数据集
val datas: Dataset[String] = sparkSession.read
.textFile("hdfs://192.168.252.121:9000/words.txt")
//3.sparksql 注册表/注册视图
import sparkSession.implicits._
val word: Dataset[String] = datas.flatMap(_.split(" "))
//4.注册视图
word.createTempView("wc_t")
//5.执行sql
val frame: DataFrame = sparkSession.sql("select value,count(*) sum from wc_t group by value order by sum desc")
frame.show()
sparkSession.stop()
}
}