I am writing a spark/scala program to read in ZIP files, unzip them and write the contents to a set of new files. I can get this to work for writing to the local file system but wondered if there was a way to to write the output files to a distributed file system such as HDFS. Code is shown below`
import java.util.zip.ZipInputStream
import org.apache.spark.input.PortableDataStream
import java.io._
var i =1
sc.binaryFiles("file:///d/tmp/zips/").flatMap((file:(String, PortableDataStream)) => {
val zipStream = new ZipInputStream(file._2.open)
val entry = zipStream.getNextEntry
val iter = scala.io.Source.fromInputStream(zipStream).getLines
val fname = f"/d/tmp/myfile$i.txt"
i = i + 1
val xx = iter.mkString
val writer = new PrintWriter(new File(fname))
writer.write(xx)
writer.close()
iter
}).collect()
`
解决方案
You can easy write data to HDFS using hadoop-common library (if you are using sbt as dependency manangement tool, add thath library to your dependency). With that you can create a FileSystem object :
private val fs = {
val conf = new Configuration()
FileSystem.get(conf)
}
Be sure to configure the FileSystem with your hadoop cluster information (core-site.xml, etc)
Then you can write, for example a String to path (in your case you should deal with streams), on HDFS as following:
@throws[IOException]
def writeAsString(hdfsPath: String, content: String) {
val path: Path = new Path(hdfsPath)
if (fs.exists(path)) {
fs.delete(path, true)
}
val dataOutputStream: FSDataOutputStream = fs.create(path)
val bw: BufferedWriter = new BufferedWriter(new OutputStreamWriter(dataOutputStream, "UTF-8"))
bw.write(content)
bw.close
}