网上学习资料一大堆,但如果学到的知识不成体系,遇到问题时只是浅尝辄止,不再深入研究,那么很难做到真正的技术提升。
一个人可以走的很快,但一群人才能走的更远!不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人,都欢迎加入我们的的圈子(技术交流、学习资源、职场吐槽、大厂内推、面试辅导),让我们一起学习成长!
m
a
DataFrame = RDD+ Schema
DataFrame=RDD+Schema
其中Schema是一个StructType对象,StructType记录着所有数据StructField(key,value)的List对象。RDD在DataFrame中常常以case class的形式进行存储,Dataset与DataFrame的不同之处就在于这个case class对于Dataset来说不是一个具体的class而是一个spark内置定义的Row对象,Spark能够根据Row对象中存储的信息动态推断出字段的数据类型
D
a
t
a
F
r
a
m
e
=
D
a
t
a
s
e
t
[
R
o
w
]
DataFrame = Dataset[Row]
DataFrame=Dataset[Row]
因此,DataFrame和Dataset都握有相应的RDD,我们均可以通过二者的无参函数字面量rdd获取相应的RDD对象
1.3 转化关系
∗
*
∗ 注:以下所有代码均默认运行在伪分布式hadoop集群-单机spark模式之下
1.3.1 RDD转DataFrame | Dataset
简单来说,就是 toDF() 以及 toDS() 两个方法,
1.3.2 DataFrame转Dataset
简单来说,就是**as[Bean]**方法,由于DataFrame会将Bean直接泛化成为Row对象,因此DataFrame转Dataset时需要显式指定Bean的相关类型,而反过来就直接使用 toDF() 即可
这个Bean实际上就是case class
1.3.3 DataFrame | Dataset转RDD
由于DataFrame | Dataset都握有相应的RDD对象,我们只需调用无参函数字面量rdd即可
1.3.4 Dataset转DataFrame
如前所述,使用 toDF() 即可,但是这个操作会丢失掉Bean的相应值而变成Row,当此DataFrame再转换回Dataset时,其Schema将会变为Row对象而不是之前的Bean对象。
2. DataFrame 数据导入
在这里,我们不使用spark-shell进行操作,而是直接通过自定义java程序连接spark集群提交spark任务
2.1 准备工作
pom.xml
首先, 我们需要构建相应的pom文件坐标,需要注意的是,如果我们使用spark连接MySQL,我们需要导入mysql-connector,如果我们需要连接hive,除去导入hive-metastore包外,还要同步导入spark-hive连接包
需要注意的是,由于spark中使用了slf4j的接口包,我们需要同步导入一个slf4j-nop的实现包,日志系统才能够正常运行
最后,为了scala文件能够正常编译,我们在build栏目下同步导入sbt支持包
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>spark-test</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<spark.version>3.5.0</spark.version>
<scala.version>2.13.8</scala.version>
<hive.version>3.1.3</hive.version>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-reflect -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-nop -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-nop</artifactId>
<version>2.0.12</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.27</version>
</dependency>
<!-- spark hive compilation -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>${hive.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-metastore -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>${hive.version}</version>
</dependency>
</dependencies>
<build>
<finalName>${project.artifactId}</finalName>
<outputDirectory>target/classes</outputDirectory>
<testOutputDirectory>target/test-classes</testOutputDirectory>
<sourceDirectory>src/main/scala</sourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<!--scala原始在sbt(类似java maven)上做开发,现可以用这个插件来在maven中进行开发-->
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<includes>
<include>**/*.scala</include>
</includes>
<scalaVersion>2.13.8</scalaVersion>
<args>
<arg>-target:jvm-1.8</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
log4j.properties
其次,我们设置log4j
log4j.rootLogger=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# og level for this class is used to overwrite the root logger’s log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=INFO
log4j.logger.org.apache.spark.sql=INFO
#Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN
log4j.logger.org.apache.parquet=WARN
log4j.logger.parquet=WARN
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
2.2 RDD转换DataFrame
下面介绍两种RDD转换DataFrame的方式
2.2.1 模式1
既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!
由于文件比较多,这里只是将部分目录截图出来,全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频,并且后续会持续更新
ADjWB-1715281916306)]
[外链图片转存中…(img-qUjbAwF2-1715281916306)]
[外链图片转存中…(img-HIXFirpg-1715281916306)]
既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!
由于文件比较多,这里只是将部分目录截图出来,全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频,并且后续会持续更新