spark版本为1.6
scala版本为2.10
jdk版本为7
直接贴代码
import java.io.{BufferedReader, InputStreamReader} import java.util.zip.ZipInputStream import org.apache.spark.input.PortableDataStream
val dataAndPortableRDD = sc.binaryFiles("zipData path") val dataRDD = dataAndPortableRDD.flatMap { case (name: String, content: PortableDataStream) => val zis = new ZipInputStream(content.open) Stream.continually(zis.getNextEntry) .takeWhile(_ != null) .flatMap { _ => val br = new BufferedReader(new InputStreamReader(zis)) Stream.continually(br.readLine()).takeWhile(_ != null) } } dataRDD.take(10).foreach(println)
参考自:https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark/
另外
如果读取一个文件夹下多个zip文件会报错:
java.io.EOFException: Unexpected end of ZLIB input stream
如果有高人路过还请赐教~