Spark:用Scala和Java实现WordCount


为了在IDEA中编写scala,今天安装配置学习了IDEA集成开发环境。IDEA确实很优秀,学会之后,用起来很顺手。关于如何搭建scala和IDEA开发环境,请看文末的参考资料。

用Scala和Java实现WordCount,其中Java实现的JavaWordCount是spark自带的例子($SPARK_HOME/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java)

1.环境

  • OS:Red Hat Enterprise Linux Server release 6.4 (Santiago)
  • Hadoop:Hadoop 2.4.1
  • JDK:1.7.0_60
  • Spark:1.1.0
  • Scala:2.11.2
  • 集成开发环境:IntelliJ IDEA 13.1.3

注意:需要在客户端windows环境下安装IDEA、Scala、JDK,并且为IDEA下载scala插件。

2.Scala实现单词计数

复制代码
 1 package com.hq
 2 
 3 /**
 4  * User: hadoop
 5  * Date: 2014/10/10 0010
 6  * Time: 18:59
 7  */
 8 import org.apache.spark.SparkConf
 9 import org.apache.spark.SparkContext
10 import org.apache.spark.SparkContext._
11 
12 /**
13  * 统计字符出现次数
14  */
15 object WordCount {
16   def main(args: Array[String]) {
17     if (args.length < 1) {
18       System.err.println("Usage: <file>")
19       System.exit(1)
20     }
21 
22     val conf = new SparkConf()
23     val sc = new SparkContext(conf)
24     val line = sc.textFile(args(0))
25 
26     line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)
27 
28     sc.stop()
29   }
30 }
复制代码

3.Java实现单词计数

  View Code

4.IDEA打包和运行

4.1 IDEA的工程结构

在IDEA中建立Scala工程,并导入spark api编程jar包(spark-assembly-1.1.0-hadoop2.4.0.jar:$SPARK_HOME/lib/里面)

 

4.2 打成jar包

File ---> Project Structure 

 

配置完成后,在菜单栏中选择Build->Build Artifacts...,然后使用Build等命令打包。打包完成后会在状态栏中显示“Compilation completed successfully...”的信息,去jar包输出路径下查看jar包,如下所示。

ScalaTest1848.jar就是我们编程所产生的jar包,里面包含了三个类HelloWord、WordCount、JavaWordCount。

可以用这个jar包在spark集群里面运行java或者scala的单词计数程序。

 4.3 以Spark集群standalone方式运行单词计数

上传jar包到服务器,并放置在/home/ebupt/test/WordCount.jar路径下。

上传一个text文本文件到HDFS作为单词计数的输入文件:hdfs://eb170:8020/user/ebupt/text

内容如下

  View Code

用spark-submit命令提交任务运行,具体使用查看:spark-submit --help

复制代码
 1 [ebupt@eb174 bin]$ spark-submit --help
 2 Spark assembly has been built with Hive, including Datanucleus jars on classpath
 3 Usage: spark-submit [options] <app jar | python file> [app options]
 4 Options:
 5   --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
 6   --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
 7                               on one of the worker machines inside the cluster ("cluster")
 8                               (Default: client).
 9   --class CLASS_NAME          Your application's main class (for Java / Scala apps).
10   --name NAME                 A name of your application.
11   --jars JARS                 Comma-separated list of local jars to include on the driver
12                               and executor classpaths.
13   --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
14                               on the PYTHONPATH for Python apps.
15   --files FILES               Comma-separated list of files to be placed in the working
16                               directory of each executor.
17 
18   --conf PROP=VALUE           Arbitrary Spark configuration property.
19   --properties-file FILE      Path to a file from which to load extra properties. If not
20                               specified, this will look for conf/spark-defaults.conf.
21 
22   --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 512M).
23   --driver-java-options       Extra Java options to pass to the driver.
24   --driver-library-path       Extra library path entries to pass to the driver.
25   --driver-class-path         Extra class path entries to pass to the driver. Note that
26                               jars added with --jars are automatically included in the
27                               classpath.
28 
29   --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
30 
31   --help, -h                  Show this help message and exit
32   --verbose, -v               Print additional debug output
33 
34  Spark standalone with cluster deploy mode only:
35   --driver-cores NUM          Cores for driver (Default: 1).
36   --supervise                 If given, restarts the driver on failure.
37 
38  Spark standalone and Mesos only:
39   --total-executor-cores NUM  Total cores for all executors.
40 
41  YARN-only:
42   --executor-cores NUM        Number of cores per executor (Default: 1).
43   --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
44   --num-executors NUM         Number of executors to launch (Default: 2).
45   --archives ARCHIVES         Comma separated list of archives to be extracted into the
46                               working directory of each executor.
复制代码

①提交scala实现的单词计数:

[ebupt@eb174 test]$ spark-submit --master spark://eb174:7077 --name WordCountByscala --class com.hq.WordCount --executor-memory 1G --total-executor-cores 2 ~/test/WordCount.jar hdfs://eb170:8020/user/ebupt/text 

②提交java实现的单词计数:

[ebupt@eb174 test]$ spark-submit --master spark://eb174:7077 --name JavaWordCountByHQ --class com.hq.JavaWordCount --executor-memory 1G --total-executor-cores 2 ~/test/WordCount.jar hdfs://eb170:8020/user/ebupt/text

③2者运行结果类似,所以只写了一个:

按 Ctrl+C 复制代码
按 Ctrl+C 复制代码
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值