Spark Scala编程常用技巧集锦
一、读写HDFS
1. 根据时间戳查找最新有效目录并按行解析JSON
(1) 获取FileSystem
//1. 生成FileSystem
def getHdfs(path: String): FileSystem = {
val conf = new Configuration()
FileSystem.newInstance(URI.create(path), conf)
}
(2) 根据时间戳获取最新目录
def findCandidate(fileSystem: FileSystem, fsPath: String): Path = {
val statusArray = fileSystem.listStatus(new Path(fsPath))
//目录时间戳排序,取最新
val sortedArray = statusArray.sortWith((s, t) => s.getPath.getName.toLong.compareTo(t.getPath.getName.toLong) > 0)
val path = sortedArray.apply(0).getPath
path
}
(3) 读取最新目录下全部有效数据文件
spark.read.text(finalPath):将文件读取为DataFrame
//获取最新目录
val validPath = findCandidate(getHdfs(path), path)
println("validFilePath: " + validPath)
val finalPath = validPath.toString.concat("/part-*")
println("finalPath: " + finalPath)
val result = spark.read.text(finalPath)
(4) 解析文件中按行存取的JSON,解析后存储到新的DataFrame中
val list = result.collect()
for (row <- list) {
val json: JSONObject = JSON.parseObject(row.getString(0))
val adIds = json.getJSONArray("ad_id").toArray
var adIdsList: scala.List[Long] = List()
for (id <- adIds) {
adIdsList = adIdsList.::(id.asInstanceOf[Number].longValue)
}
val features = json.getJSONArray("feature").toArray
val imgId = json.getString("img_id")
val imgUrl = json.getString("img_url")
val width = json.getIntValue("width")
val height = json.getIntValue("height")
val date = json.getString("date")
val isImg = json.getString("type")
val extention = json.getString("extention")
val path = json.getString("path")
val source = json.getString("source")
dataList.add(Row(adIdsList.toArray, features, imgId, imgUrl, width, height, date, isImg, extention, path, source))
}
其中,dataList需要事先定义好Row的Scheme,如下所示
val schema = StructType(List(
StructField("s_ad_id", ArrayType(LongType, true), true),
StructField("feature", ArrayType(StringType, true), true),
StructField("img_id", StringType, true),
StructField("img_url", StringType, true),
StructField("width", IntegerType, true),
StructField("height", IntegerType, true),
StructField("date", StringType, true),
StructField("format_type", StringType, true),
StructField("extention", StringType, true),
StructField("path", StringType, true),
StructField("source", StringType, true)
))
val dataList = new util.ArrayList[Row]()
(5) 根据dataList创建新的DataFrame
var df2 = spark.createDataFrame(dataList, schema)
PS:未完待续