Learning Spark 学习笔记 第二章 Downloading Spark and Getting Started


spark方面的书太少,而且中文版的几乎没有,且仅有的中文版都翻译的一塌糊涂,所以我这个英文很烂的人只能一边开着有道一边硬着头皮啃英文版的书,先入手一本《Learning Spark》,只是为了记录自己学习过程,有用的东西记下来方便以后查询。


第二章为下载spark和快速起步,download之后tar解压。spark提供两种交互式的shell:spark shell和PySpark shell,对应于两种语言scala和python,由于spark是scala写的,自己这两种语言都没学过,当然从scala开始。

spark解压后目录下运行:

bin/pyspark启动python shell

bin/spark-shell启动scala shell

启动过程会打印很多log,为了方便查看,把不必要的log去掉:

拷贝conf/log4j.properties.template为conf/log4j.properties,并找到 log4j.rootCategory=INFO, console改为log4j.rootCategory=WARN, console


scala shell的一个例子:

scala> val lines = sc.textFile("README.md")

//用文件README.md创建一个RDD lines(关于RDD后面章节会讲到,是spark比较重要的一个概念),这里就把README.md的每一行取出来作为一个数据集,在此数据集上做一些列的并行计算

scala> lines.count()

// 计算RDD lines的items,也就是README.md有几行,结果如下:

res0: Long = 127

scala> lines.first()

// first函数查看RDD的第一个item,对于此处就是文件README.md第一行,结果如下:

res1: String = # Apache Spark

这就是一个简单的spark 处理过程,退出shell:Ctrl-D


上图是程序在集群上运行图,Diriver Program是spark 应用的一个重要部分,它包括你应用的main函数,并在集群上定义分布式数据集(RDD),在数据集上做一系列操作。通过SparkContext(上面程序的sc)对象来进行,连接到计算机集群上。通常部署在name node上,而并行的task分散到data node上来进行计算。得到sc之后就可以用它创建RDDs。要在lines上进行count操作,若README.md为很大的文件,且spark部署在hadoop yarn上,那么我们知道README.md会分成很多块分布到不同的data node上,Diriver Program来管理这些data node上的executor对自己的数据库分别进行计算。、

我们也可以在lines上进行如下等一些操作:

scala> val pythonLines = lines.filter(line => line.contains("Python"))//返回包含单词Python的RDD。(不懂的可自行去学习scala语言)


接下来是单机模式的wordcount程序,怎么编写及编译打包,提交运行,这里就不讲了,关注点在之后章节的集群模式。wordcount scala代码如下,很好理解:

// Create a Scala Spark Context.
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input =  sc.textFile(inputFile)
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into pairs and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(outputFile)


这一章看下来意识到自己的英文水平还是非常有待提高······能看懂,但翻译的一塌糊涂!语言组织能力太差!


Learning Apache Spark 2 by Muhammad Asif Abbasi English | 6 Jun. 2017 | ASIN: B01M7RO7US | 356 Pages | AZW3 | 16.22 MB Key Features Exclusive guide that covers how to get up and running with fast data processing using Apache Spark Explore and exploit various possibilities with Apache Spark using real-world use cases in this book Want to perform efficient data processing at real time? This book will be your one-stop solution. Book Description Spark juggernaut keeps on rolling and getting more and more momentum each day. The core challenge are they key capabilities in Spark (Spark SQL, Spark Streaming, Spark ML, Spark R, Graph X) etc. Having understood the key capabilities, it is important to understand how Spark can be used, in terms of being installed as a Standalone framework or as a part of existing Hadoop installation and configuring with Yarn and Mesos. The next part of the journey after installation is using key components, APIs, Clustering, machine learning APIs, data pipelines, parallel programming. It is important to understand why each framework component is key, how widely it is being used, its stability and pertinent use cases. Once we understand the individual components, we will take a couple of real life advanced analytics examples like: Building a Recommendation system Predicting customer churn The objective of these real life examples is to give the reader confidence of using Spark for real-world problems. What you will learn Overview Big Data Analytics and its importance for organizations and data professionals. Delve into Spark to see how it is different from existing processing platforms Understand the intricacies of various file formats, and how to process them with Apache Spark. Realize how to deploy Spark with YARN, MESOS or a Stand-alone cluster manager. Learn the concepts of Spark SQL, SchemaRDD, Caching, Spark UDFs and working with Hive and Parquet file formats Understand the architecture of Spark MLLib while discussing some of the
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

hjbbjh0521

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值