spark学习

诗酒醉云长

已于 2023-07-08 21:38:42 修改

阅读量115

点赞数

文章标签：学习 spark

于 2022-06-24 11:58:25 首次发布

本文链接：https://blog.csdn.net/u013131379/article/details/125360994

版权

三步两类：

三步： input(采集+存储）->process（计算）->output（同步+应用）

两类：离线+实时

	采集	存储	计算	应用
离线其他->数仓->同步->应用	数据同步（其他->数据仓库）：Sqoop(1关系型DB->2数仓）	文件存储（数据仓库多次读少写->离线计算）：Hadoop HDFS、Tachyon、KFS	离线计算（2数据仓库->3数据同步/K-V,NoSQL）：MR、Spark SQL	数据挖掘、机器学习：Mahout、Spark MLLib
实时日志->消息系统->同步->应用	日志收集(实时日志->消息系统）：Flume、Scribe、Logstash、Kibana	消息系统（即时消费->流式、实时计算/数仓）：Kafka、StormMQ、ZeroMQ、RabbitMQ	流式、实时计算(消息系统->数据同步/K-V,NoSQL)：Storm、Spark Streaming、S4、Heron K-V、NOSQL数据库：HBase、Redis、MongoDB、ES	数据挖掘、机器学习：Mahout、Spark MLLib

采集

存储

计算

应用

离线

其他->数仓->同步->应用

数据同步（其他->数据仓库）：Sqoop(1关系型DB->2数仓）

文件存储（数据仓库多次读少写->离线计算）：Hadoop HDFS、Tachyon、KFS

离线计算（2数据仓库->3数据同步/K-V,NoSQL）：MR、Spark SQL

数据挖掘、机器学习：Mahout、Spark MLLib

实时

日志->消息系统->同步->应用

日志收集(实时日志->消息系统）：Flume、Scribe、Logstash、Kibana

消息系统（即时消费->流式、实时计算/数仓）：Kafka、StormMQ、ZeroMQ、RabbitMQ

流式、实时计算(消息系统->数据同步/K-V,NoSQL)：Storm、Spark Streaming、S4、Heron
K-V、NOSQL数据库：HBase、Redis、MongoDB、ES

数据挖掘、机器学习：Mahout、Spark MLLib

sysdm.cpl修改系统变量

(1)打开spark-shell

cd /d %SPARK_HOME%/bin

spark-shell

spark.version

(2)打开历史服务器

cd /d %SPARK_HOME%/bin
spark-class.cmd org.apache.spark.deploy.history.HistoryServer

Instructions to the driver are called Transformations and action will trigger the execution.

application
<-jobs(actions)
<-stages(tasksets divided by wider transformations like groupBy, reduceBy)
<-task(one partition one task)

目前只支持最高jdk11,

Note that you can create only one active SparkContext per JVM. You should stop() the active SparkContext before creating a new one.

//SparkContext stop() method
spark.sparkContext.stop()

sparkSession->dataframe

sparkContext->rdd

<?xml version="1.0" encoding="utf-8"?> 
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" 
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
          xsi:schemaLocation="     http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd"> 
          <!--本地仓库。该值表示构建系统本地仓库的路径。其默认值为~/.m2/repository。 -->
    <localRepository>D:\m2\repository</localRepository> 

          <!--Maven是否需要和用户交互以获得输入。如果Maven需要和用户交互以获得输入，则设置成true，反之则应为false。默认为true。 
    <interactiveMode>true</interactiveMode> 
          -->
          <mirrors> 
    <!-- mirror | Specifies a repository mirror site to use instead of a given 
      repository. The repository that | this mirror serves has an ID that matches 
      the mirrorOf element of this mirror. IDs are used | for inheritance and direct 
      lookup purposes, and must be unique across the set of mirrors. | --> 
    <mirror> 
          <id>nexus-aliyun</id> 
          <mirrorOf>central</mirrorOf> 
          <name>Nexus aliyun</name> 
          <url>http://maven.aliyun.com/nexus/content/groups/public/</url> 
        </mirror> 
    <mirror> 
          <id>net-cn</id> 
          <mirrorOf>central</mirrorOf> 
          <name>Nexus net</name> 
          <url>http://maven.net.cn/content/groups/public/</url> 
        </mirror> 
  </mirrors> 
        
          <profiles> 
    <profile> 
         
          <repositories> 
                <repository> 
                      <id>nexus</id> 
                      <name>local private nexus</name> 
                      <url>http://maven.aliyun.com/nexus/content/groups/public/</url> 
                      <releases> 
                            <enabled>true</enabled> 
                          </releases> 
                      <snapshots> 
                            <enabled>false</enabled> 
                          </snapshots> 
                    </repository> 
              </repositories> 
          <pluginRepositories> 
                <pluginRepository> 
                      <id>nexus</id> 
                      <name>local private nexus</name> 
                      <url>http://maven.aliyun.com/nexus/content/groups/public/</url> 
                      <releases> 
                            <enabled>true</enabled> 
                          </releases> 
                      <snapshots> 
                            <enabled>false</enabled> 
                          </snapshots> 
                    </pluginRepository> 
              </pluginRepositories> 
        </profile> 
  </profiles> 
          <!-- --> 
  <activeProfiles>
    <activeProfile>nexus</activeProfile> 
  </activeProfiles> 
        </settings>

IDEA创建Maven工程出现Could not transfer artifact org.apache.hadoop报错的解决方法 - 码农教程

maven编译报错Blocked mirror for repositories解决_Menardღ的博客-CSDN博客

数据工程师入门需知

：基础A->框架->底层源码

A.大数据初学者学习指南(建议收藏)

：Input->save&process->output

任务调度与监控系统：
1. 资源管理：YARN、Mesos
2. 分布式协调服务：Zookeeper
3. 集群管理与监控：Ambari、Ganglia、Nagios、Cloudera Manager
4. 任务调度：Oozie