一.环境的准备(hadoop-2.8.0/spark-2.1.0/scala-2.12.)
hadoop的安装/scala的安装
二.安装配置
1.查看/etc/profile的配置
export JAVA_HOME=/opt/jdk
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export SCALA_HOME=/home/sulei/文档/scala-2.12.1
export PATH=${JAVA_HOME}/bin:$PATH
export PATH="$SCALA_HOME/bin:$PATH"
2.编辑conf/spark-env.sh
export JAVA_HOME=/opt/jdk
export SCALA_HOME=/home/sulei/文档/scala-2.12.1
export SPARK_WORKER_MEMORY=1G
3.查看web的界面
4.bin/pyspark
三.简单的程序的测试
**补充挂在
sulei@sulei:/opt/spark-2.1.0$ df -lh
**
文件系统 容量 已用 可用 已用% 挂载点
udev 3.4G 0 3.4G 0% /dev
tmpfs 694M 9.4M 685M 2% /run
/dev/sda11 40G 16G 22G 42% /
tmpfs 3.4G 588K 3.4G 1% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 3.4G 0 3.4G 0% /sys/fs/cgroup
/dev/sda2 256M 33M 224M 13% /boot/efi
tmpfs 694M 76K 694M 1% /run/user/1000
/dev/sda9 310G 272G 38G 88% /media/sulei/32B03CC6B03C9279
scala> val textFile=sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:24
#界面中没有出现效果,原因是:懒加载
#此处出错,如下
scala> val textFile=sc.textFile("../README.md")
textFile: org.apache.spark.rdd.RDD[String] = ../README.md MapPartitionsRDD[7] at textFile at <console>:24
scala> textFile.count()
res4: Long = 104
web的结果图
scala> textFile.first()
res5: String = # Apache Spark
scala> textFile.take(10)
res6: Array[String] = Array(# Apache Spark, "", Spark is a fast and general cluster computing system for Big Data. It provides, high-level APIs in Scala, Java, Python, and R, and an optimized engine that, supports general computation graphs for data analysis. It also supports a, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, MLlib for machine learning, GraphX for graph processing,, and Spark Streaming for stream processing., "", <http://spark.apache.org/>)
scala> textFile.filter(line => line.contains("Spark")).count()
res7: Long = 20
四.wordcount程序