Spark实现WordCount

文章介绍了两种使用ApacheSpark实现WordCount的方法。第一种方法涉及建立Spark框架连接,读取文件,通过flatMap拆分数据,使用groupByKey进行分组,然后map进行转换计数。第二种方法同样读取文件和拆分数据,但使用了map和reduceByKey结合的方式,提供了分组和聚合功能。两种方法都在本地模式下运行并打印结果。
摘要由CSDN通过智能技术生成

第一种实现:

import org.apache.spark.{SparkConf, SparkContext}

object Spark01_WordCount {
  def main(args: Array[String]): Unit = {
    //Application
    //一:Spark框架
    //TODO 建立和Spark框架的连接
    val sparkConf = new SparkConf().setMaster("local").setAppName("WordCount")
    val sc = new SparkContext(sparkConf)
    //二:TODO 执行业务操作
    //1.读取文件,获取一行一行的数据:hello world
    val lines = sc.textFile("datas")
    //2.将一行行数据进行拆分(扁平化):"hello world => hello,world"
    val words = lines.flatMap(_.split(" "))
    //3.将数据单词分组,便于统计:(hello,hello,hello), (world,world)
    val wordGroup = words.groupBy(word => word)
    //4.对分组后的数据进行转换:(hello,hello,hello), (world,world) => (hello,3), (world,2)
    val wordToCount = wordGroup.map {
      case (word, list) => {
        (word, list.size)
      }
    }
    //5.将转换结果在控制台打印出来
    val arrary = wordToCount.collect().foreach(println)
    //三:TODO 关闭连接
    sc.stop()
  }
}

第二种实现:

RDD中的转换算子:reduceByKey和groupByKey一样可以分组,但是reduceByKey还提供聚合功能。

import org.apache.spark.{SparkConf, SparkContext}

object Spark02_WordCount {
  def main(args: Array[String]): Unit = {
    //Application
    //一:Spark框架
    //TODO 建立和Spark框架的连接
    val sparkConf = new SparkConf().setMaster("local").setAppName("WordCount")
    val sc = new SparkContext(sparkConf)
    //二:TODO 执行业务操作
    //1.读取文件,获取一行一行的数据:hello world
    val lines = sc.textFile("datas")
    //2.将一行行数据进行拆分(扁平化):"hello world => hello,world"
    val words = lines.flatMap(_.split(" "))
    //********************************************
    //3.hello,world => (hello,1),(world,1)
    val wordToOne = words.map(
      word => (word, 1)
    )
    //4.reduceByKey:提供分组和聚合功能
    val wordToCount = wordToOne.reduceByKey(_ + _)
    //*********************************************
    //5.将转换结果在控制台打印出来
    val arrary = wordToCount.collect().foreach(println)
    //三:TODO 关闭连接
    sc.stop()
  }
}

错误日志打印:

 

log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值