三步两类:
三步: input(采集+存储)->process(计算)->output(同步+应用)
两类: 离线+实时
采集 | 存储 | 计算 | 应用 | |
---|---|---|---|---|
离线 其他->数仓->同步->应用 | 数据同步(其他->数据仓库):Sqoop(1关系型DB->2数仓) | 文件存储(数据仓库多次读少写->离线计算):Hadoop HDFS、Tachyon、KFS | 离线计算(2数据仓库->3数据同步/K-V,NoSQL):MR、Spark SQL | 数据挖掘、机器学习:Mahout、Spark MLLib |
实时 日志->消息系统->同步->应用 | 日志收集(实时日志->消息系统):Flume、Scribe、Logstash、Kibana | 消息系统(即时消费->流式、实时计算/数仓):Kafka、StormMQ、ZeroMQ、RabbitMQ | 流式、实时计算(消息系统->数据同步/K-V,NoSQL):Storm、Spark Streaming、S4、Heron K-V、NOSQL数据库:HBase、Redis、MongoDB、ES | 数据挖掘、机器学习:Mahout、Spark MLLib |
sysdm.cpl修改系统变量
(1)打开spark-shell
cd /d %SPARK_HOME%/bin
spark-shell
spark.version
(2)打开历史服务器
cd /d %SPARK_HOME%/bin
spark-class.cmd org.apache.spark.deploy.history.HistoryServer
Instructions to the driver are called Transformations and action will trigger the execution.
application
<-jobs(actions)
<-stages(tasksets divided by wider transformations like groupBy, reduceBy)
<-task(one partition one task)
目前只支持最高jdk11,
Note that you can create only one active SparkContext per JVM. You should stop() the active SparkContext before creating a new one.
//SparkContext stop() method
spark.sparkContext.stop()
sparkSession->dataframe
sparkContext->rdd
<?xml version="1.0" encoding="utf-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation=" http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
<!--本地仓库。该值表示构建系统本地仓库的路径。其默认值为~/.m2/repository。 -->
<localRepository>D:\m2\repository</localRepository>
<!--Maven是否需要和用户交互以获得输入。如果Maven需要和用户交互以获得输入,则设置成true,反之则应为false。默认为true。
<interactiveMode>true</interactiveMode>
-->
<mirrors>
<!-- mirror | Specifies a repository mirror site to use instead of a given
repository. The repository that | this mirror serves has an ID that matches
the mirrorOf element of this mirror. IDs are used | for inheritance and direct
lookup purposes, and must be unique across the set of mirrors. | -->
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>central</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</mirror>
<mirror>
<id>net-cn</id>
<mirrorOf>central</mirrorOf>
<name>Nexus net</name>
<url>http://maven.net.cn/content/groups/public/</url>
</mirror>
</mirrors>
<profiles>
<profile>
<repositories>
<repository>
<id>nexus</id>
<name>local private nexus</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>nexus</id>
<name>local private nexus</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</pluginRepository>
</pluginRepositories>
</profile>
</profiles>
<!-- -->
<activeProfiles>
<activeProfile>nexus</activeProfile>
</activeProfiles>
</settings>
IDEA创建Maven工程出现Could not transfer artifact org.apache.hadoop报错的解决方法 - 码农教程
maven编译报错Blocked mirror for repositories解决_Menardღ的博客-CSDN博客
:基础A->框架->底层源码
:Input->save&process->output
任务调度与监控系统:
1. 资源管理:YARN、Mesos
2. 分布式协调服务:Zookeeper
3. 集群管理与监控:Ambari、Ganglia、Nagios、Cloudera Manager
4. 任务调度:Oozie