1.准备
- 工具准备:这里我直接使用spark-shell写 命令行中输入
spark-shell
进入,前提是配好了环境变量,如果没有,请到spark下使用bin/spark-shell
命令进入 - 文件准备:这里我直接使用spark包下的
README.md
文件,如果你的spark是单节点的,直接在spark路径下进入spark-shell,使用相对路径引用即可,我使用的是spark-on-yarn模式,所以我的默认文件系统是hdfs,我把它put到hdfs的相应路径下了。这一点可以注意一下。
已经进入spark-shell 这里的两句话
Spark context available as 'sc' .
#Sparkcontext 可以使用'sc'访问
Spark session available as 'spark'.
#Sparksession 使用'spark' 访问
2. 开始写
第一种方式,先使用 SparkContext来读取文件,使用textFile方法
scala> val text = sc.textFile("README.md")
text: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> text.collect
res2: Array[String] = Array(# Apache Spark, "", Spark is a fast and general cluster computing system for Big Data. It provides, high-level APIs in Scala, Java, Python, and R, and an optimized engine that, supports general computation graphs for data analysis. It also supports a, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, MLlib for machine learning, GraphX for graph processing,, and Spark Streaming for stream processing., "", <http://spark.apache.org/>, "", "", ## Online Documentation, "", You can find the latest Spark documentation, including a programming, guide, on the [project web page](http://spark.apache.org/documentation.html)., This README file only contains basic setup instructions., "", ## Building Spark, "", Spark is built using [Apache Maven](...
// 读取文件,生成RDD[String]
scala> val flatMaped = text.flatMap(line => line.split(" "))
flatMaped: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:25
scala> flatMaped.collect
res3: Array[String] = Array(#, Apache, Spark, "", Spark, is, a, fast, and, general, cluster, computing, system, for, Big, Data., It, provides, high-level, APIs, in, Scala,, Java,, Python,, and, R,, and, an, optimized, engine, that, supports, general, computation, graphs, for, data, analysis., It, also, supports, a, rich, set, of, higher-level, tools, including, Spark, SQL, for, SQL, and, DataFrames,, MLlib, for, machine, learning,, GraphX, for, graph, processing,, and, Spark, Streaming, for, stream, processing., "", <http://spark.apache.org/>, "", "", ##, Online, Documentation, "", You, can, find, the, latest, Spark, documentation,, including, a, programming, guide,, on