Spark Scala编程常用技巧集锦

一、读写HDFS

1. 根据时间戳查找最新有效目录并按行解析JSON

(1) 获取FileSystem
//1. 生成FileSystem
  def getHdfs(path: String): FileSystem = {
    val conf = new Configuration()
    FileSystem.newInstance(URI.create(path), conf)
  }
(2) 根据时间戳获取最新目录
def findCandidate(fileSystem: FileSystem, fsPath: String): Path = {
    val statusArray = fileSystem.listStatus(new Path(fsPath))
    //目录时间戳排序,取最新
    val sortedArray = statusArray.sortWith((s, t) => s.getPath.getName.toLong.compareTo(t.getPath.getName.toLong) > 0)
    val path = sortedArray.apply(0).getPath
    path
  }
(3) 读取最新目录下全部有效数据文件

spark.read.text(finalPath):将文件读取为DataFrame

//获取最新目录
    val validPath = findCandidate(getHdfs(path), path)
    println("validFilePath: " + validPath)
    val finalPath = validPath.toString.concat("/part-*")
    println("finalPath: " + finalPath)
    val result = spark.read.text(finalPath)
(4) 解析文件中按行存取的JSON,解析后存储到新的DataFrame中
val list = result.collect()
    for (row <- list) {
      val json: JSONObject = JSON.parseObject(row.getString(0))
      val adIds = json.getJSONArray("ad_id").toArray
      var adIdsList: scala.List[Long] = List()
      for (id <- adIds) {
        adIdsList = adIdsList.::(id.asInstanceOf[Number].longValue)
      }
      val features = json.getJSONArray("feature").toArray
      val imgId = json.getString("img_id")
      val imgUrl = json.getString("img_url")
      val width = json.getIntValue("width")
      val height = json.getIntValue("height")
      val date = json.getString("date")
      val isImg = json.getString("type")
      val extention = json.getString("extention")
      val path = json.getString("path")
      val source = json.getString("source")
      dataList.add(Row(adIdsList.toArray, features, imgId, imgUrl, width, height, date, isImg, extention, path, source))
    }

其中,dataList需要事先定义好Row的Scheme,如下所示

val schema = StructType(List(
      StructField("s_ad_id", ArrayType(LongType, true), true),
      StructField("feature", ArrayType(StringType, true), true),
      StructField("img_id", StringType, true),
      StructField("img_url", StringType, true),
      StructField("width", IntegerType, true),
      StructField("height", IntegerType, true),
      StructField("date", StringType, true),
      StructField("format_type", StringType, true),
      StructField("extention", StringType, true),
      StructField("path", StringType, true),
      StructField("source", StringType, true)
    ))
    val dataList = new util.ArrayList[Row]()
(5) 根据dataList创建新的DataFrame
var df2 = spark.createDataFrame(dataList, schema)

PS:未完待续

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值