Spark Scala编程常用技巧集锦

最新推荐文章于 2020-09-30 15:50:59 发布

太阳下的兰花草

最新推荐文章于 2020-09-30 15:50:59 发布

阅读量283

点赞数

分类专栏： Spark 大数据

本文链接：https://blog.csdn.net/starlywang/article/details/104808657

版权

Spark 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

大数据

4 篇文章 0 订阅

订阅专栏

Spark Scala编程常用技巧集锦

- 一、读写HDFS
- - 1. 根据时间戳查找最新有效目录并按行解析JSON

一、读写HDFS

1. 根据时间戳查找最新有效目录并按行解析JSON

(1) 获取FileSystem

//1. 生成FileSystem
  def getHdfs(path: String): FileSystem = {
    val conf = new Configuration()
    FileSystem.newInstance(URI.create(path), conf)
  }

(2) 根据时间戳获取最新目录

def findCandidate(fileSystem: FileSystem, fsPath: String): Path = {
    val statusArray = fileSystem.listStatus(new Path(fsPath))
    //目录时间戳排序，取最新
    val sortedArray = statusArray.sortWith((s, t) => s.getPath.getName.toLong.compareTo(t.getPath.getName.toLong) > 0)
    val path = sortedArray.apply(0).getPath
    path
  }

(3) 读取最新目录下全部有效数据文件

spark.read.text(finalPath)：将文件读取为DataFrame

//获取最新目录
    val validPath = findCandidate(getHdfs(path), path)
    println("validFilePath: " + validPath)
    val finalPath = validPath.toString.concat("/part-*")
    println("finalPath: " + finalPath)
    val result = spark.read.text(finalPath)

(4) 解析文件中按行存取的JSON，解析后存储到新的DataFrame中

val list = result.collect()
    for (row <- list) {
      val json: JSONObject = JSON.parseObject(row.getString(0))
      val adIds = json.getJSONArray("ad_id").toArray
      var adIdsList: scala.List[Long] = List()
      for (id <- adIds) {
        adIdsList = adIdsList.::(id.asInstanceOf[Number].longValue)
      }
      val features = json.getJSONArray("feature").toArray
      val imgId = json.getString("img_id")
      val imgUrl = json.getString("img_url")
      val width = json.getIntValue("width")
      val height = json.getIntValue("height")
      val date = json.getString("date")
      val isImg = json.getString("type")
      val extention = json.getString("extention")
      val path = json.getString("path")
      val source = json.getString("source")
      dataList.add(Row(adIdsList.toArray, features, imgId, imgUrl, width, height, date, isImg, extention, path, source))
    }

其中，dataList需要事先定义好Row的Scheme，如下所示

val schema = StructType(List(
      StructField("s_ad_id", ArrayType(LongType, true), true),
      StructField("feature", ArrayType(StringType, true), true),
      StructField("img_id", StringType, true),
      StructField("img_url", StringType, true),
      StructField("width", IntegerType, true),
      StructField("height", IntegerType, true),
      StructField("date", StringType, true),
      StructField("format_type", StringType, true),
      StructField("extention", StringType, true),
      StructField("path", StringType, true),
      StructField("source", StringType, true)
    ))
    val dataList = new util.ArrayList[Row]()

(5) 根据dataList创建新的DataFrame

var df2 = spark.createDataFrame(dataList, schema)

PS：未完待续

太阳下的兰花草

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录