单词统计
shell方式
读取本地文件
val lineRdd=sc.textFile(“file:/opt/spark-2.4.5/aa.txt”)
转换成键值对的方式
val wordRdd=lineRdd.flatMap(line => line.split(" "))
拆分单词
lineRdd.map(line =>line.split(" ")).collect
转换成键值对的方式
val pairRdd=wordRdd.map(word => (word,1))
分组做累加
var reduceRdd=pairRdd.reduceByKey((x,y)=>x+y)
输出
reduceRdd.collect
//懒加载执行
xxx.collect / xxx.collect()
以上总和执行
sc.textFile(“file:/opt/spark-2.4.5/aa.txt”).flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey((x,y)=>x+y).collect()
简化版
sc.textFile(“file:/opt/spark-2.4.5/aa.txt”).flatMap(.split(" ")).map((,1)).reduceByKey(+).collect
idea方式:
package com.spark.core
import org.apache.spark.{SparkConf, SparkContext}
/**
- 单词统计
/
object WordCount {
//本地运行
//System.setProperty(“hadoop.home.dir”,“D:\soft\hadoop\hadoop-2.9.2”)
def main(args: Array[String]): Unit = {
//1.生成spark core总入口这个对象
val conf = new SparkConf()/.setMaster(“local”)*/.setAppName(“wordcount”)
val sc = new SparkContext(conf)
//2.单词统计
sc.textFile(args(0))
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y) => x+y)
.saveAsTextFile(args(1)) //存储路径位置
// .foreach(println) //控制台输出
//3.关闭环境
sc.stop()
}
}
hadoop:没有加file:///. 默认访问hdfs
注意:使用spark单独的东西需要注释掉 /opt/spark-2.4.5/conf/spark-env.sh 里的: export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
spark:没有加file:///. 默认访问本地 如果访问hdfs 需要加上hdfs://
需要开启spark (不需要hadoop)
bin/spark-submit --master spark://linux-star:7077 --class com.spark.core.WordCount spark_demo-1.0-SNAPSHOT.jar file:/opt/spark-2.4.5/aa.txt file:/opt/spark-2.4.5/output
yarn方式需要开启hadoop 关闭spark
bin/spark-submit --master yarn --class com.spark.core.WordCount spark_demo-1.0-SNAPSHOT.jar file:opt/spark-2.4.5/aa.txt file:opt/spark-2.4.5/out