Spark 基础、实践

最新推荐文章于 2022-07-20 23:17:55 发布

得克特

最新推荐文章于 2022-07-20 23:17:55 发布

阅读量576

点赞数

分类专栏：大数据文章标签： spark

本文链接：https://blog.csdn.net/weixin_40548136/article/details/100879402

版权

大数据专栏收录该内容

32 篇文章 1 订阅

订阅专栏

文章目录

1.安装Spark

在安装hadoop的基础上，从官网下载Spark，解压到目标文件夹，然后配置~/hadoop/etc/hadoop/yarn-site.xml配置文件。

后面可以参考Spark 2.0分布式集群环境搭建和Spark分布式集群环境搭建
集群环境搭建主要包含两个配置文件spark-env.sh和conf/slaves，然后将修改好的spark文件分发到子节点。

2.安装idea

Scala环境搭建及Intellij IDEA安装
添加Scala插件，这里要下载安装Scala插件（注意对应好idea版本）
安装Scala-SDKScala-SDK，这里我感觉不下载也可以，如果我们使用sbt构建项目，可以选择scala版本，sbt在构建依赖会下载指定的scala版本。
将spark的jars包添加到file–>project structure–>libraries–>+

3.启动Spark

启动hadoop集群

$cd /usr/local/hadoop/
$sbin/start-all.sh

启动spark集群

$cd /usr/local/spark
$sbin/start-all.sh

查看集群运行情况
在这里插入图片描述

4.Spark理论

从高层次上来看，每一个Spark应用都包含一个驱动程序，用于执行用户的main函数以及在集群上运行各种并行操作。Spark提供的主要抽象是弹性分布式数据集（RDD），这是一个包含诸多元素、被划分到不同节点上进行并行处理的数据集合。RDD通过打开HDFS（或其他hadoop支持的文件系统）上的一个文件、在驱动程序中打开一个已有的Scala集合或由其他RDD转换操作得到。用户可以要求Spark将RDD持久化到内存中，这样就可以有效地在并行操作中复用。另外，在节点发生错误时RDD可以自动恢复。
Spark提供的另一个抽象是可以在并行操作中使用的共享变量。在默认情况下，当Spark将一个函数转化成许多任务在不同的节点上运行的时候，对于所有在函数中使用的变量，每一个任务都会得到一个副本。有时，某一个变量需要在任务之间或任务与驱动程序之间共享。Spark支持两种共享变量：广播变量，用来将一个值缓存到所有节点的内存中；累加器，只能用于累加，比如计数器和求和。

Spark应用程序作为集群上的独立进程集运行，由主程序(称为驱动程序)中的SparkContext对象协调。具体来说，要在集群上运行，SparkContext可以连接到几种类型的集群管理器(Spark自己的独立集群管理器、Mesos或YARN)，这些管理器在应用程序之间分配资源。一旦连接好，Spark将获得集群中节点上的executor，这些节点是运行计算并为应用程序存储数据的进程。接下来，它将您的应用程序代码(由传递给SparkContext的JAR或Python文件定义)发送给执行器。最后，SparkContext将任务发送给执行者运行。
在这里插入图片描述

5.实践

./bin/run-example SparkPi 10 运行一个样例代码，实际调用spark-submit提交样例脚本
./bin/spark-shell --master local[2] 启动交互式的spark scala shell，在master local（也可以选择分布式的集群master的url）运行，分配两个线程。spark-shell可用来与分布式存储在许多机器的内存或者硬盘上的数据交互。spark-shell --help查看选项。
./bin/spark-submit examples/src/main/python/pi.py 10通过spark-submit提交py样例
./bin/pyspark 启动python版本的shell
./bin/sparkR --master local[2]交互式的r语言接口
以下在命令行调用pyspark运行

>>> textFile = spark.read.text("README.md")
>>> textFile.count() # Dataframe的行数
105                                                                             
>>> textFile.first() # Dataframe的第一行
Row(value='# Apache Spark')
>>> linesWithSpark = textFile.filter(textFile.value.contains("Spark")) # 过滤获取包含“Spark”的行
>>> textFile.filter(textFile.value.contains("Spark")).count()
20
>>> from pyspark.sql.functions import *
#按空格切分计算word个数命名为"numWords"，然后获取最大的"numWords"
#`select`和`agg`的参数都是`Column`，我们可以通过`df.colName`获取指定列
>>> textFile.select(size(split(textFile.value,"\s+")).name("numWords")).agg(max(col("numWords"))).collect()
[Row(max(numWords)=22)]
>>> textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
DataFrame[word: string, count: bigint]
#`explode`函数将行Dataset转换为words的数据集，执行`groupBy`和`count`计算每个word的计数
>>> wordCounts = textFile.select(explode(split(textFile.value, "\s+")).alias("word")).groupBy("word").count()
##获取计数通过`collect()`
>>> wordCounts.collect()
[Row(word='online', count=1), Row(word='graphs', count=1), Row(word='["Parallel', count=1), Row(word='["Building', count=1), Row(word='thread', count=1), Row(word='documentation', count=3), Row(word='command,', count=2), Row(word='abbreviated', count=1), Row(word='overview', count=1), Row(word='rich', count=1), Row(word='set', count=2), Row(word='-DskipTests', count=1), Row(word='name', count=1), Row(word='page](http://spark.apache.org/documentation.html).', count=1), Row(word='["Specifying', count=1), Row(word='stream', count=1), Row(word='run:', count=1), Row(word='not', count=1), Row(word='programs', count=2), Row(word='tests', count=2), Row(word='./dev/run-tests', count=1), Row(word='will', count=1), Row(word='[run', count=1), Row(word='particular', count=2), Row(word='option', count=1), Row(word='Alternatively,', count=1), Row(word='by', count=1), Row(word='must', count=1), Row(word='using', count=5), Row(word='you', count=4), Row(word='MLlib', count=1), Row(word='DataFrames,', count=1), Row(word='variable', count=1), Row(word='Note', count=1), Row(word='core', count=1), Row(word='more', count=1), Row(word='protocols', count=1), Row(word='guidance', count=2), Row(word='shell:', count=2), Row(word='can', count=7), Row(word='site,', count=1), Row(word='systems.', count=1), Row(word='Maven', count=1), Row(word='[building', count=1), Row(word='configure', count=1), Row(word='for', count=12), Row(word='README', count=1), Row(word='Interactive', count=2), Row(word='how', count=3), Row(word='[Configuration', count=1), Row(word='Hive', count=2), Row(word='system', count=1), Row(word='provides', count=1), Row(word='Hadoop-supported', count=1), Row(word='pre-built', count=1), Row(word='["Useful', count=1), Row(word='directory.', count=1), Row(word='Example', count=1), Row(word='example', count=3), Row(word='Kubernetes', count=1), Row(word='one', count=3), Row(word='MASTER', count=1), Row(word='in', count=6), Row(word='library', count=1), Row(word='Spark.', count=1), Row(word='contains', count=1), Row(word='Configuration', count=1), Row(word='programming', count=1), Row(word='with', count=4), Row(word='contributing', count=1), Row(word='downloaded', count=1), Row(word='1000).count()', count=1), Row(word='comes', count=1), Row(word='machine', count=1), Row(word='Tools"](http://spark.apache.org/developer-tools.html).', count=1), Row(word='building', count=2), Row(word='params', count=1), Row(word='Guide](http://spark.apache.org/docs/latest/configuration.html)', count=1), Row(word='given.', count=1), Row(word='be', count=2), Row(word='same', count=1), Row(word='integration', count=1), Row(word='than', count=1), Row(word='Programs', count=1), Row(word='locally', count=2), Row(word='using:', count=1), Row(word='fast', count=1), Row(word='[Apache', count=1), Row(word='your', count=1), Row(word='optimized', count=1), Row(word='Developer', count=1), Row(word='R,', count=1), Row(word='should', count=2), Row(word='graph', count=1), Row(word='package', count=1), Row(word='-T', count=1), Row(word='[project', count=1), Row(word='project', count=1), Row(word='`examples`', count=2), Row(word='resource-managers/kubernetes/integration-tests/README.md', count=1), Row(word='versions', count=1), Row(word='Spark](#building-spark).', count=1), Row(word='general', count=3), Row(word='other', count=1), Row(word='learning,', count=1), Row(word='when', count=1), Row(word='submit', count=1), Row(word='Apache', count=1), Row(word='1000:', count=2), Row(word='detailed', count=2), Row(word='About', count=1), Row(word='is', count=7), Row(word='on', count=7), Row(word='scala>', count=1), Row(word='print', count=1), Row(word='use', count=3), Row(word='different', count=1), Row(word='following', count=2), Row(word='SparkPi', count=2), Row(word='refer', count=2), Row(word='./bin/run-example', count=2), Row(word='data', count=1), Row(word='Tests', count=1), Row(word='Versions', count=1), Row(word='Data.', count=1), Row(word='processing.', count=1), Row(word='its', count=1), Row(word='basic', count=1), Row(word='latest', count=1), Row(word='only', count=1), Row(word='<class>', count=1), Row(word='have', count=1), Row(word='runs.', count=1), Row(word='You', count=4), Row(word='tips,', count=1), Row(word='project.', count=1), Row(word='developing', count=1), Row(word='YARN,', count=1), Row(word='It', count=2), Row(word='"local"', count=1), Row(word='processing,', count=1), Row(word='built', count=1), Row(word='Pi', count=1), Row(word='thread,', count=1), Row(word='A', count=1), Row(word='APIs', count=1), Row(word='Scala,', count=1), Row(word='file', count=1), Row(word='computation', count=1), Row(word='Once', count=1), Row(word='find', count=1), Row(word='the', count=24), Row(word='To', count=2), Row(word='sc.parallelize(1', count=1), Row(word='uses', count=1), Row(word='Version', count=1), Row(word='N', count=1), Row(word='programs,', count=1), Row(word='"yarn"', count=1), Row(word='see', count=4), Row(word='./bin/pyspark', count=1), Row(word='return', count=2), Row(word='computing', count=1), Row(word='Java,', count=1), Row(word='from', count=1), Row(word='Because', count=1), Row(word='cluster', count=2), Row(word='Streaming', count=1), Row(word='More', count=1), Row(word='analysis.', count=1), Row(word='Maven](http://maven.apache.org/).', count=1), Row(word='cluster.', count=1), Row(word='Running', count=1), Row(word='Please', count=4), Row(word='talk', count=1), Row(word='distributions.', count=1), Row(word='guide,', count=1), Row(word='tests](http://spark.apache.org/developer-tools.html#individual-tests).', count=1), Row(word='There', count=1), Row(word='"local[N]"', count=1), Row(word='Try', count=1), Row(word='and', count=10), Row(word='do', count=2), Row(word='Scala', count=2), Row(word='class', count=2), Row(word='build', count=4), Row(word='3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).', count=1), Row(word='setup', count=1), Row(word='need', count=1), Row(word='spark://', count=1), Row(word='Hadoop,', count=2), Row(word='Thriftserver', count=1), Row(word='are', count=1), Row(word='requires', count=1), Row(word='package.', count=1), Row(word='Enabling', count=1), Row(word='clean', count=1), Row(word='sc.parallelize(range(1000)).count()', count=1), Row(word='high-level', count=1), Row(word='SQL', count=2), Row(word='against', count=1), Row(word='of', count=5), Row(word='through', count=1), Row(word='review', count=1), Row(word='package.)', count=1), Row(word='Python,', count=2), Row(word='easiest', count=1), Row(word='no', count=1), Row(word='Testing', count=1), Row(word='several', count=1), Row(word='help', count=1), Row(word='The', count=1), Row(word='sample', count=1), Row(word='MASTER=spark://host:7077', count=1), Row(word='Big', count=1), Row(word='examples', count=2), Row(word='an', count=4), Row(word='#', count=1), Row(word='Online', count=1), Row(word='test,', count=1), Row(word='including', count=4), Row(word='usage', count=1), Row(word='Python', count=2), Row(word='at', count=2), Row(word='development', count=1), Row(word='Spark"](http://spark.apache.org/docs/latest/building-spark.html).', count=1), Row(word='IDE,', count=1), Row(word='way', count=1), Row(word='Contributing', count=1), Row(word='get', count=1), Row(word='that', count=2), Row(word='##', count=9), Row(word='For', count=3), Row(word='prefer', count=1), Row(word='This', count=2), Row(word='build/mvn', count=1), Row(word='builds', count=1), Row(word='running', count=1), Row(word='web', count=1), Row(word='run', count=7), Row(word='locally.', count=1), Row(word='Spark', count=16), Row(word='URL,', count=1), Row(word='a', count=9), Row(word='higher-level', count=1), Row(word='tools', count=1), Row(word='if', count=4), Row(word='available', count=1), Row(word='', count=48), Row(word='Documentation', count=1), Row(word='this', count=1), Row(word='(You', count=1), Row(word='>>>', count=1), Row(word='information', count=1), Row(word='info', count=1), Row(word='<http://spark.apache.org/>', count=1), Row(word='Shell', count=2), Row(word='environment', count=1), Row(word='built,', count=1), Row(word='module,', count=1), Row(word='them,', count=1), Row(word='`./bin/run-example', count=1), Row(word='instance:', count=1), Row(word='first', count=1), Row(word='[Contribution', count=1), Row(word='guide](http://spark.apache.org/contributing.html)', count=1), Row(word='documentation,', count=1), Row(word='[params]`.', count=1), Row(word='mesos://', count=1), Row(word='engine', count=1), Row(word='GraphX', count=1), Row(word='Maven,', count=1), Row(word='example:', count=1), Row(word='HDFS', count=1), Row(word='YARN"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn)', count=1), Row(word='or', count=3), Row(word='to', count=17), Row(word='Hadoop', count=3), Row(word='individual', count=1), Row(word='also', count=5), Row(word='changed', count=1), Row(word='started', count=1), Row(word='./bin/spark-shell', count=1), Row(word='threads.', count=1), Row(word='supports', count=2), Row(word='storage', count=1), Row(word='version', count=1), Row(word='instructions.', count=1), Row(word='Building', count=1), Row(word='start', count=1), Row(word='Many', count=1), Row(word='which', count=2), Row(word='And', count=1), Row(word='distribution', count=1)]
#Spark支持将数据集拖放到集群范围的内存缓存中，有利于重复访问数据，比如查询一个小的"hot"数据集或者运行类似PageRank迭代算法，以下将标记我们的linesWithSpark数据集缓存
>>> linesWithSpark.cache()
DataFrame[value: string]
>>> linesWithSpark.count()
20
>>> linesWithSpark.count()
20