spark读取本地文件的方式是给路径加上file://,例如sc.textFile(“/opt/software/spark1.4/README.md”),我发现一个问题就是,我的spark软件安装在/opt/software下,我读取/opt/software目录下的文件不报错,可以正常读取,但是我想把spark机器学习这本书上的例子做一下,于是我下载了MovieLens的数据集,并且把它放在/home/hadoop/downloads/ml-100k/ml-100k下,我调用val rswData=sc.textFile(“file:///home/hadoop/downloads/ml-100k/ml-100k/u.data”),发现出错了。报错如下:
scala> val rswData=sc.textFile(“file:///home/hadoop/downloads/ml-100k/ml-100k/u.data”)
rswData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at :21
scala> rswData.first()
16/10/27 02:54:22 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4, 192.168.1.112): java.io.FileNotFoundException: File file:/home/hadoop/downloads/ml-100k/ml-100k/u.data does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.hadoop.fs.ChecksumFileSystem ChecksumFSInputChecker.(ChecksumFileSystem.java:140)atorg.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)atorg.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)atorg.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)atorg.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)atorg.apache.spark.rdd.HadoopRDD anon 1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)