spark wordcount 实例

项目github地址:bitcarmanlee easy-algorithm-interview-and-practice
欢迎大家star,留言,一起学习进步

spark集群搭建完毕以后,自然需要来测试一下。大数据领域的第一个程序自然就是wordcount了,就好像我们新接触一门编码语言,第一个程序就是hello world一样。接下来,我们就尝试用各种方式在spark里实现wordcount。

1.准备数据

首先准备一个简单文件aaa,然后put到hdfs上面:

[root@namenodetest01 spark-1.6.0-bin-hadoop2.4]# hadoop fs -cat /wanglei/data/aaa
hello world
aaa hello
bbb world
ccc hello

2.spark-shell的方式

首先我们尝试用spark-shell的方式实现wordcount的功能。

首先找到input:

scala> val file=sc.textFile("/wanglei/data/aaa")
16/07/21 18:10:38 INFO storage.MemoryStore: Block broadcast_11 stored as values in memory (estimated size 179.7 KB, free 1064.3 KB)
16/07/21 18:10:38 INFO storage.MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 17.5 KB, free 1081.8 KB)
16/07/21 18:10:38 INFO storage.BlockManagerInfo: Added broadcast_11_piece0 in memory on localhost:23917 (size: 17.5 KB, free: 511.4 MB)
16/07/21 18:10:38 INFO spark.SparkContext: Created broadcast 11 from textFile at <console>:27
file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at textFile at <console>:27

然后执行wordcount逻辑:

scala> val count=file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
16/07/21 18:11:10 INFO mapred.FileInputFormat: Total input paths to process : 1
count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[28] at reduceByKey at <console>:29

最后查看结果:

scala> count.collect()

吧啦吧啦输出一堆消息,我们在最后一行找到结果:

res3: Array[(String, Int)] = Array((bbb,1), (hello,3), (world,2), (ccc,1), (aaa,1))

3.用py脚本提交

[root@namenodetest01 spark-1.6.0-bin-hadoop2.4]# ./bin/spark-submit --master local[2] /data/wanglei/soft/spark-1.6.0-bin-hadoop2.4/examples/src/main/python/wordcount.py /wanglei/data/aaa

稍微解释一下上面的命令:
spark-submit是表示提交任务,–master local[2]表明是本地模式并且指定两个线程,后面那一长串是spark自带的wordcount.py脚本,最后的参数则表示输入数据。

让代码run起来以后,同样也会吧啦吧啦输出一大堆,我们将输出结果的部分找出来:

16/07/21 18:17:23 INFO scheduler.DAGScheduler: ResultStage 1 (collect at /data/wanglei/soft/spark-1.6.0-bin-hadoop2.4/examples/src/main/python/wordcount.py:35) finished in 0.089 s
16/07/21 18:17:23 INFO scheduler.DAGScheduler: Job 0 finished: collect at /data/wanglei/soft/spark-1.6.0-bin-hadoop2.4/examples/src/main/python/wordcount.py:35, took 0.719143 s
world: 2
aaa: 1
hello: 3
ccc: 1
bbb: 1
16/07/21 18:17:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}

4.用java类提交

[root@namenodetest01 spark-1.6.0-bin-hadoop2.4]# ./bin/run-example JavaWordCount /wanglei/data/aaa

先将输出结果部分抠出来:

16/07/21 18:24:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/07/21 18:24:18 INFO scheduler.DAGScheduler: Job 0 finished: collect at JavaWordCount.java:68, took 0.879351 s
bbb: 1
hello: 3
ccc: 1
aaa: 1
world: 2
16/07/21 18:24:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}

简单分析一下代码:

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

EXAMPLES_DIR="${SPARK_HOME}"/examples

. "${SPARK_HOME}"/bin/load-spark-env.sh

if [ -n "$1" ]; then
  EXAMPLE_CLASS="$1"
  shift
else
  echo "Usage: ./bin/run-example <example-class> [example-args]" 1>&2
  echo "  - set MASTER=XX to use a specific master" 1>&2
  echo "  - can use abbreviated example class name relative to com.apache.spark.examples" 1>&2
  echo "     (e.g. SparkPi, mllib.LinearRegression, streaming.KinesisWordCountASL)" 1>&2
  exit 1
fi

if [ -f "${SPARK_HOME}/RELEASE" ]; then
  JAR_PATH="${SPARK_HOME}/lib"
else
  JAR_PATH="${EXAMPLES_DIR}/target/scala-${SPARK_SCALA_VERSION}"
fi

JAR_COUNT=0

for f in "${JAR_PATH}"/spark-examples-*hadoop*.jar; do
  if [[ ! -e "$f" ]]; then
    echo "Failed to find Spark examples assembly in ${SPARK_HOME}/lib or ${SPARK_HOME}/examples/target" 1>&2
    echo "You need to build Spark before running this program" 1>&2
    exit 1
  fi
  SPARK_EXAMPLES_JAR="$f"
  JAR_COUNT=$((JAR_COUNT+1))
done

if [ "$JAR_COUNT" -gt "1" ]; then
  echo "Found multiple Spark examples assembly jars in ${JAR_PATH}" 1>&2
  ls "${JAR_PATH}"/spark-examples-*hadoop*.jar 1>&2
  echo "Please remove all but one jar." 1>&2
  exit 1
fi

export SPARK_EXAMPLES_JAR

EXAMPLE_MASTER=${MASTER:-"local[*]"}

if [[ ! $EXAMPLE_CLASS == org.apache.spark.examples* ]]; then
  EXAMPLE_CLASS="org.apache.spark.examples.$EXAMPLE_CLASS"
fi

exec "${SPARK_HOME}"/bin/spark-submit \
  --master $EXAMPLE_MASTER \
  --class $EXAMPLE_CLASS \
  "$SPARK_EXAMPLES_JAR" \
  "$@"

很容易看出,run-example脚本其实也是调用的spark-submit,然后输入的第一个参数为class类名,spark自带的安装包里,已经帮我们用java实现了wordcount的类!

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值