(1)先准备一个名为test.txt的文档,该文档内容如下:
Apple Apple Orange
Banana Grape Grape
(2)上传文档
然后使用secureCRT上传到Linux系统上。上传完毕后,检查文档
zhang@Desktop1:~$ ls | grep 'test.txt'
test.txt
(3)查看内容
zhang@Desktop1:~$ cat test.txt
Apple Apple Orange
Banana Grape Grape
说明文档已经上传成功了
(4)执行start-dfs.sh启动hadoop
(5)执行spark-shell启动,进入spark交互界面
zhang@Desktop1:~$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/10 20:57:25 WARN util.Utils: Your hostname, Desktop1 resolves to a loopback address: 127.0.1.1; using 192.168.8.3 instead (on interface ens33)
17/05/10 20:57:25 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://192.168.8.3:4040
Spark context available as 'sc' (master = local[*], app id = local-1494421049820).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
(6)读取test.txt文本文件
//如果是读取分布式文件系统上的文件,则写sc.textFile("hdfs://......")
scala> val textfile=sc.textFile("file:/home/zhang/test.txt")
17/05/10 20:59:52 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
textfile: org.apache.spark.rdd.RDD[String] = file:/home/zhangchao/test.txt MapPartitionsRDD[1] at textFile at <console>:24
scala>
(7)使用flatMap空格符分隔单词,并读取每个单词
scala> val stringRDD=textfile.flatMap(t=>t.split(" "))
stringRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26
scala>
(8)通过map reduce计算每一个单词出现的次数
scala> val countsRDD=stringRDD.map(word=>(word,1)).reduceByKey(_ + _)
countsRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:28
scala>
(9)保存计算结果
scala> countsRDD.saveAsTextFile("file:/home/zhang/output")
scala>
(10)退出spark-shell
scala> :q
(11)查看输出结果
zhang@Desktop1:~$ ls
derby.log log4j-slf4j-impl-2.4.1.jar test.txt 公共的 图片 音乐
examples.desktop mysql-connector-java-5.1.41-bin.jar VMwareTools-9.6.2-1688356.tar.gz 模板 文档 桌面
filemacsn.txt output vmware-tools-distrib 视频 下载
可以看到在用户主目录下已经存在一个output文件夹,然后cd到该目录下面,并查看有哪些文件。
zhang@Desktop1:~/output$ ls
part-00000 _SUCCESS
其中part-00000保存了输出结果,现在查看输出结果。
zhang@Desktop1:~/output$ cat part-00000
(Grape,2)
(Orange,1)
(Apple,2)
(Banana,1)
可以看到,该输出结果与test.txt文档中的内容是完全一致的,即:
Grape出现2次
Orange出现1次
Apple出现2次
Banana出现1次