记录-bigdata-使用scala语言，使用spark获取ODS层指定分区表中的数据并进行相应的清洗（缺失字段填充、去重）---接上一篇文章

wuzxu

已于 2022-04-20 09:47:16 修改

阅读量3.6k

点赞数 5

分类专栏： liunx 文章标签： spark scala

于 2022-04-07 18:06:07 首次发布

本文链接：https://blog.csdn.net/qq_43009048/article/details/124022572

版权

liunx 专栏收录该内容

6 篇文章 8 订阅

订阅专栏

有问题可以私聊我交流

我这里是完成编码之后，打包发送到集群上运行的！！！

上一篇是从mysql抽取数据到hive的ods层

这一篇是清洗ods层的表到dwd层

1.使用IDEA创建MAVEN项目

pom配置如下

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.tledu</groupId>
  <artifactId>llll</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>${project.artifactId}</name>
  <description>My wonderfull scala app</description>
  <inceptionYear>2018</inceptionYear>
  <licenses>
    <license>
      <name>My License</name>
      <url>http://....</url>
      <distribution>repo</distribution>
    </license>
  </licenses>
 
  <properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.version>2.11.11</scala.version>
    <scala.compat.version>2.11</scala.compat.version>
    <spec2.version>4.2.0</spec2.version>
  </properties>
 
  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
 
 
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.compat.version}</artifactId>
      <version>2.3.2</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_${scala.compat.version}</artifactId>
      <version>2.3.2</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>2.0.2</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>8.0.23</version>
    </dependency>
 
 
 
    <!-- Test -->
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalatest</groupId>
      <artifactId>scalatest_${scala.compat.version}</artifactId>
      <version>3.0.5</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.specs2</groupId>
      <artifactId>specs2-core_${scala.compat.version}</artifactId>
      <version>${spec2.version}</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.specs2</groupId>
      <artifactId>specs2-junit_${scala.compat.version}</artifactId>
      <version>${spec2.version}</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
 
  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <!-- see http://davidb.github.com/scala-maven-plugin -->
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.3.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <version>2.21.0</version>
        <configuration>
          <!-- Tests will be run with scalatest-maven-plugin instead -->
          <skipTests>true</skipTests>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.scalatest</groupId>
        <artifactId>scalatest-maven-plugin</artifactId>
        <version>2.0.0</version>
        <configuration>
          <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
          <junitxml>.</junitxml>
          <filereports>TestSuiteReport.txt</filereports>
          <!-- Comma separated list of JUnit test class names to execute -->
          <jUnitClasses>samples.AppTest</jUnitClasses>
        </configuration>
        <executions>
          <execution>
            <id>test</id>
            <goals>
              <goal>test</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
 
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>assembly</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

编程过程

 // 清洗数据  创建sparksession
    val spark = SparkSession
      .builder()
      .appName("数据清洗服务")
      .enableHiveSupport()
      .getOrCreate()

//使用hive的ods库
    spark.sql("use ods")

//这里使用的是数据去重操作
    val df = spark.table("orders").select("orderkey", "custkey", "orderstatus",                         
         "totalprice", "orderdate", "orderpriority", "clerk", "shippriority",
         "comment", "etldate")

// df2就是去重之后的数据
    val df2 = df.distinct()

//给df2创建视图
    df2.createOrReplaceTempView("data_temp")

//这里还把有关时间的字段   换成了timestamp类型
    spark.sql("select to_unix_timestamp(orderdate,'yyyy-MM-dd'),to_unix_timestamp(etldate,'yyyy-MM-dd') from data_temp")

//再使用dwd库  把从ods清洗的数据放进去
spark.sql("use dwd")
    spark.sql(
      """
        |insert overwrite table dwdorders
        |select * from data_temp
        |""".stripMargin)
    spark.close()

关于有关jar包的引入，看我的另一篇文章

使用Scala语言，使用Spark抽取MySQL指定数据表中的数据到HIVE的ODS层的表中_wuzxu的博客-CSDN博客

2.集群运行

rz-bye 上传jar包

启动运行jar包

我用的是脚本

#! /bin/bash
export HADOOP_CONF_DIR=/usr/hdp/3.1.0.0-78/hadoop/conf
/usr/hdp/3.1.0.0-78/spark2/bin/spark-submit \
--class 这里是你要运行的类 \
--master local[2] \
--driver-memory 512m \
--executor-memory 512m \
--num-executors 2 \
/这里是你jar包的地址    最前面有这个/哦
#export 的是你的spark-submi的地址哦

wuzxu

关注

5
点赞
踩
35

收藏

觉得还不错? 一键收藏
2
评论
记录-bigdata-使用scala语言，使用spark获取ODS层指定分区表中的数据并进行相应的清洗（缺失字段填充、去重）---接上一篇文章

我这里是完成编码之后，打包发送到集群上运行的！！！1.使用IDEA创建MAVEN项目pom配置如下<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.x
复制链接

扫一扫