Spark之RDD
RDD概述
RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD具有数据流模型的特点:自动容错、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中,后续的查询能够重用工作集,这极大地提升了查询速度。
构建RDD
通过并行集合构建RDD实例
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("rdd demo")
val sc = new SparkContext(sparkConf)
val arr = Array(1,2,3,4,5)
// 并行集合构建rdd实例
val rdd = sc.parallelize(arr,2)
println(rdd.reduce(_+_))
通过外部数据集构建RDD实例
外部数据(本地文件系统)
// 外部数据(本地文件系统)
val rdd = sc.textFile("/Users/gaozhy/test.txt")
rdd.flatMap(line => line.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)
外部数据(HDFS、s3、etc)
// 外部数据(HDFS,s3,etc)
val rdd = sc.textFile("hdfs://spark:9000/test.txt")
rdd.flatMap(line => line.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)
注:需导入HDFS Client依赖
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.6</version> </dependency>
外部数据(MySQL)
需依赖MySQL的驱动Jar包
方法一
def createConnnection(): Connection = {
Class.forName("com.mysql.cj.jdbc.Driver").newInstance()
DriverManager.getConnection("jdbc:mysql://spark:3306/test", "root", "root")
}
def query(rs: ResultSet) = {
(rs.getInt(1),rs.getString(2),rs.getString(3))
}
val rdd = new JdbcRDD(sc, createConnnection, "select * from t_user where ? <= id and id <= ?", 1, 5, 2, query)
rdd.collect().foreach(println)
方法二
val conf = new Configuration()
conf.set(DBConfiguration.DRIVER_CLASS_PROPERTY, "com.mysql.cj.jdbc.Driver")
conf.set(DBConfiguration.URL_PROPERTY, "jdbc:mysql://spark:3306/test")
conf.set(DBConfiguration.USERNAME_PROPERTY, "root")
conf.set(DBConfiguration.PASSWORD_PROPERTY, "root")
conf.set(DBConfiguration.INPUT_QUERY, "select * from t_user")
conf.set(DBConfiguration.INPUT_COUNT_QUERY, "select count(*) from t_user")
conf.set(DBConfiguration.INPUT_TABLE_NAME_PROPERTY, "t_user")
// 注意:必须配置实体对象的全限定名
conf.set(DBConfiguration.INPUT_CLASS_PROPERTY,"rdd.User")
val rdd = sc.newAPIHadoopRDD(conf, classOf[DBInputFormat[User]], classOf[LongWritable], classOf[User])
rdd.foreach(i => println(i._2))
Jar包依赖问题有效的解决方案
如果在外部使用spark-submit
提交spark应用,会抛出异常找不到MySQL的驱动类,原因是因为Spark应用运行需要访问MySQL数据源构建RDD实例,解决Jar包依赖的方案如下:
-
extraClassPath
创建jar包扩展目录,将Spark应用依赖的jar包拷贝到externalJars目录中,需保证spark集群中所有机器该目录存在
mkdir ~/software/externalJars cp ~/mysql-connector-java-8.0.13.jar ~/software/externalJars vim $SPARK_HOME/conf/spark-defaults.conf # 配置文件声明jar包扩展目录 spark.executor.extraClassPath=/Users/gaozhy/software/externalJars/* spark.driver.extraClassPath=/Users/gaozhy/software/externalJars/* # 提交任务 spark-submit --class rdd.SparkRDD02 --master spark://spark:7077 /Users/gaozhy/workspace/20180429/spark01/target/spark01-1.0-SNAPSHOT.jar
进入到 http://spark:8080 查看执行结果
- 将依赖jar包放置到
$SPARK_HOME/jars
目录
外部数据(HBase)
运行环境
- HBase版本为
1.4.10
- Spark版本为
2.4.3
Maven依赖
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
<!--<scope>provided</scope>-->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.4.10</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.4.10</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.5.0</version>
</dependency>
</dependencies>
运行代码
package rdd
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, HConstants}
import org.apache.spark.{SparkConf, SparkContext}
object SparkRDDWithHBase {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("rdd demo")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
conf.set(HConstants.ZOOKEEPER_QUORUM, "spark")
conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
conf.set(TableInputFormat.INPUT_TABLE, "baizhi:t_user")
conf.set(TableInputFormat.SCAN_COLUMNS,"cf1:name cf1:age cf2:hobbies")
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println(rdd.count)
rdd.foreach(i => {
val result = i._2
val name = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")))
val age = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("age")))
val hobbies = Bytes.toString(result.getValue(Bytes.toBytes("cf2"), Bytes.toBytes("hobbies")))
println(name + " | " + age + " | " + hobbies)
})
sc.stop()
}
}
读取结果
外部运行(spark-submit
)
1) maven添加插件
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<!--这里要替换成jar包main方法所在类 -->
<mainClass>rdd.SparkRDDWithHBase</mainClass>
</manifest>
<manifestEntries>
<Class-Path>.</Class-Path>
</manifestEntries>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- 指定在打包节点执行jar包合并操作 -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
2)通过maven插件package
打包,最终产生jar包
spark02-1.0-SNAPSHOT-jar-with-dependencies.jar
3)通过spark-submit
提交任务
spark-submit --class rdd.SparkRDDWithHBase --master spark://127.0.0.1:7077 /Users/gaozhy/workspace/20180429/spark02/target/spark02-1.0-SNAPSHOT-jar-with-dependencies.jar
4)执行结果