Spark RDD详解（什么是RDD、创建RDD的几种方式）

最新推荐文章于 2023-06-05 12:10:52 发布

macaoyuan0527

最新推荐文章于 2023-06-05 12:10:52 发布

阅读量682

点赞数 1

分类专栏：大数据文章标签： mysql hadoop spark 大数据数据库

本文链接：https://blog.csdn.net/qq_39211575/article/details/103980029

版权

大数据专栏收录该内容

28 篇文章 1 订阅

订阅专栏

什么是RDD？

RDD（ resilient distributed dataset ） 弹性分布式数据集；RDD代表是一个不可变的、可分区的、支持并行计算的元素集合（类似于Scala中的不可变集合），RDD可以通过HDFS、Scala集合、RDD转换、外部的数据集（支持InputFormat）获得；并且我们可以通知Spark将RDD持久化在内存中，可以非常高效的重复利用或者在某些计算节点故障时自动数据恢复；
RDD是Spark中最基本的数据抽象

创建RDD的几种方式：

scala集合创建

方法一

/**
  * scala集合构建
  */
object RDDCreatedByCollection {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("datasource").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val range: Range = 1 to 100
    val rdd = sc.parallelize(range)

    // spark求和计算
    val sum = rdd.reduce((v1: Int, v2: Int) => v1 + v2)

    println(s"sum=$sum  | 分区数量=" + rdd.getNumPartitions)

    sc.stop()
  }
}

方法二

/**
  * 方法二
  */
def m2(): Unit = {
    val conf = new SparkConf().setAppName("datasource").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val range: Range = 1 to 100
    val rdd = sc.makeRDD(range,2)

    // spark求和计算
    val sum = rdd.reduce((v1: Int, v2: Int) => v1 + v2)

    println(s"sum=$sum  | 分区数量=" + rdd.getNumPartitions)

    sc.stop()
}

通过外部数据集构建

文件系统

Local（本地文件系统）

/**
  * 本地文件构建RDD
  */
object RDDCreatedByLocalFS {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("datasource").setMaster("local[*]")
    val sc = new SparkContext(conf)

    // 方法一: 读取目录或者文件内容，将每一行数据作为RDD中的一个元素
    // val rdd = sc.textFile("file:///D://data", 5)
    // rdd.foreach((line: String) => println(line))

    // 方法二：读取目录或者文件内容，将文件路径作为K，将文件内容作为V
    // Tuple2(path,content)
    val rdd: RDD[(String, String)] = sc.wholeTextFiles("file:///D://data", 5)
    rdd
      .map(t2 => t2._2) // Tuple2 ---> content:String
      .flatMap((content: String) => content.split("\\n"))
      .flatMap((line: String) => line.split("\\s"))
      .map((_, 1L))
      .groupByKey()
      .map(t2 => (t2._1, t2._2.size))
      .foreach(println)

    sc.stop()
  }
}

HDFS

/**
  * 本地文件构建RDD
  */
object RDDCreatedByHDFS {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("datasource").setMaster("local[*]")
    val sc = new SparkContext(conf)

    // 方法一: 读取目录或者文件内容，将每一行数据作为RDD中的一个元素
    /*
    val rdd = sc.textFile("hdfs://SparkOnStandalone:9000/data.txt", 5)
    rdd.foreach((line: String) => println(line))
     */

    // 方法二：读取目录或者文件内容，将文件路径作为K，将文件内容作为V
    // Tuple2(path,content)
    val rdd: RDD[(String, String)] = sc.wholeTextFiles("hdfs://SparkOnStandalone:9000/data.txt", 5)
    rdd
      .map(t2 => t2._2) // Tuple2 ---> content:String
      .flatMap((content: String) => content.split("\\n"))
      .flatMap((line: String) => line.split("\\s"))
      .map((_, 1L))
      .groupByKey()
      .map(t2 => (t2._1, t2._2.size))
      .foreach(println)
    sc.stop()
  }
}

关系型数据库

比如MySQL:

先导入驱动jar包： 部署类未找到远程依赖问题解决

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.47</version>
</dependency>

/**
  * DB构建RDD
  */
object RDDCreatedByDB {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("datasource").setMaster("local[*]")
    val sc = new SparkContext(conf)
    // m1(sc)
    m2(sc)

    sc.stop()
  }

  /**
    * Hadoop DBInputFormat
    *
    * @param sc
    */
  def m2(sc: SparkContext): Unit = {

    val conf = sc.hadoopConfiguration
    conf.set(DBConfiguration.DRIVER_CLASS_PROPERTY, "com.mysql.jdbc.Driver")
    conf.set(DBConfiguration.URL_PROPERTY, "jdbc:mysql://192.168.197.1:3306/test")
    conf.set(DBConfiguration.USERNAME_PROPERTY, "root")
    conf.set(DBConfiguration.PASSWORD_PROPERTY, "1234")
    conf.set(DBConfiguration.INPUT_QUERY, "select * from t_user")
    // 查询数据库表的记录数，用以计算分区
    conf.set(DBConfiguration.INPUT_COUNT_QUERY, "select count(*) from t_user")
    conf.set(DBConfiguration.INPUT_CLASS_PROPERTY, "com.netzhuo.datasource.User")


    val rdd = sc.newAPIHadoopRDD(conf, classOf[DBInputFormat[User]], classOf[LongWritable], classOf[User])
    rdd
      .foreach(t2 => {
        println("k=" + t2._1 + "    | v=" + t2._2)
      })
  }

  /**
    * 通过JDBCRDD构建  有比较大的局限性，原因是SQL语句中必须定义上下边界，否则无法使用
    *
    * @param sc
    */
  def m1(sc: SparkContext): Unit = {
    // 方法一
    val rdd = new JdbcRDD(
      sc,
      () => {
        Class.forName("com.mysql.jdbc.Driver")
        val connection = DriverManager.getConnection("jdbc:mysql://192.168.197.1:3306/test", "root", "1234")
        connection
      },
      "select * from t_user where id >= ? and id <= ?",
      1,  // 下边界
      4,  // 上边界
      2,  // 分区数量
      rs => {
        val id = rs.getInt(1)
        val name = rs.getString(2)
        val sex = rs.getString(3)
        (id, name, sex)
      }
    )

    rdd
      .foreach(t3 => println(t3._1 + "\t" + t3._2 + "\t" + t3._3))
  }
}

通过HBase创建

导入HBase第三方依赖

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.4.10</version>
</dependency>

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.4.10</version>
</dependency>

<dependency>
    <groupId>com.google.protobuf</groupId>
    <artifactId>protobuf-java</artifactId>
    <version>2.5.0</version>
</dependency>

注意：请提前准备好Hbase数据 这里用的是一个t_user表字段是id name sex

/**
  * HBase构建RDD
  */
object RDDCreatedByHBase {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("datasource").setMaster("local[*]")
    // val conf = new SparkConf().setAppName("datasource").setMaster("spark://SparkOnStandalone:7077")
    val sc = new SparkContext(conf)

    val configuration = sc.hadoopConfiguration
    configuration.set(HConstants.ZOOKEEPER_QUORUM, "HadoopNode00")
    configuration.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
    // hbase表名
    configuration.set(TableInputFormat.INPUT_TABLE, "netzhuo2:t_user")
    // 查询字段列表
    configuration.set(TableInputFormat.SCAN_COLUMNS, "cf1:name cf1:age")

    val rdd = sc.newAPIHadoopRDD(configuration, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

    rdd.foreach(t2 => {
      val rowkey = Bytes.toString(t2._1.get())
      val name = Bytes.toString(t2._2.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")))
      val age = Bytes.toString(t2._2.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("age")))
      println(rowkey + "\t" + name + "\t" + age)
    })

    sc.stop()
  }

}

macaoyuan0527

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark RDD详解（什么是RDD、创建RDD的几种方式）

什么是RDD？RDD（ resilient distributed dataset ）弹性分布式数据集；RDD代表是一个不可变的、可分区的、支持并行计算的元素集合（类似于Scala中的不可变集合），RDD可以通过HDFS、Scala集合、RDD转换、外部的数据集（支持InputFormat）获得；并且我们可以通知Spark将RDD持久化在内存中，可以非常高效的重复利用或者在某些计算节点故障时自...
复制链接

扫一扫

专栏目录