一、Spark读取Linux本地文件
val textFile = spark.read.textFile("file:///usr/spark-2.3.1-bin-hadoop2.7/README.md")
[root@master spark-2.3.1-bin-hadoop2.7]# ./bin/spark-shell
2019-01-06 21:48:02 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://master:4040
Spark context available as 'sc' (master = local[*], app id = local-1546782513279).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val textFile = spark.read.textFile("file:///usr/spark-2.3.1-bin-hadoop2.7/README.md")
2019-01-06 21:49:51 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> textFile.first()
res0: String = # Apache Spark
scala> textFile.count()
res1: Long = 103
scala>
二、Spark读取HDFS文件
首先需要把文件put到HDFS中
hadoop fs -put README.md /user/root/
scala> val textFile = spark.read.textFile("README.md")
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://master:9000/user/root/README.md;
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:715)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:732)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:702)
... 49 elided
scala> val textFile2 = spark.read.textFile("README.md")
textFile2: org.apache.spark.sql.Dataset[String] = [value: string]
scala> textFile2.first()
res2: String = # Apache Spark
scala> val textFile3 = spark.read.textFile("/user/root/README.md")
textFile3: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val textFile3 = spark.read.textFile("/user/README.md")
org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://master:9000/user/README.md;
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:715)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:732)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:702)
... 49 elided
scala>
三、groupByKey(identity)
identity(恒等式)等同于a=>a,一个直接将参数返回的函数
scala> val wordCounts=textFile.flatMap(line=>line.split(" ")).groupByKey(identity).count()
wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]
scala> val wordCounts=textFile.flatMap(line=>line.split(" ")).groupByKey(a=>a).count()
wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]