Machine Learning With Spark学习笔记

此笔记为本人在阅读Machine Learning With Spark的时候所做的,笔记有翻译不准确或错误的地方欢迎大家指正。

Spark集群

Spark集群由两种进程组成:一个驱动程序和多个执行程序。在本地模式下,所有的进程都在同一个Java虚拟机中运行。在集群上,这些进程则通常在多个节点上运行。

比如,在单机环境下运行的集群有以下特征:
1、一个主节点作为spark单机模式的主进程和驱动程序。
2、一系列工作节点,每一个作为执行进程。

Spark编程模型

SparkContext

SparkContext(Java中的JavaSparkContext)是spark编写程序的出发点,SparkContext由SparkConf对象的实例初始化,它包含了一系列spark集群的配置信息,比如主节点的URL。

一旦被初始化,我们可以使用一系列的SparkContext对象内置的方法、操作分布式数据集和全局变量。

SparkShell

Spark shell负责上下文的初始化工作
本地模式下Scala创建上下文的例子:

val conf = new SparkConf().setAppNme("Test Spark App").setMaster("local[4]")
val sc = new SparkContext(conf)

这在本地模式下创建了一个4线程的上下文,名字是Test Spark App。

Spark Shell

在Spark根目录下输入:./bin/spark-shell就能启动spark shell,如下图:
spark shell
如果想在spark中使用Python shell,那么输入./bin/pyspark,如下图:
Pyspark

Resilient Distributed Datasets(RDD,弹性分布式数据集)

RDD是一系列记录的集合,严格来说,是某一类型的对象,以分布式或者分段的方式分布在集群的诸多节点上。

Spark中的RDD具有容错性,如果一个节点或者任务(task)运行失败了,比如硬件故障,通讯丢失等,除了不正确的操作,RDD能够在剩下的节点上自动重建,将这个任务(job)完成。

创建RDD

RDD可以通过集合创建,如下:

val collection = List("a","b","c","d","e")
val rddFromCollection = sc.parallelize(collection)

RDD同样可以通过基于的Hadoop输入源创建,包括本地文件系统,HDFS等。
基于Hadoop的RDD可以利用任何实现了Hadoop InputFormat接口的数据格式,包括文本文件,其他Hadoop标准格式,HBase,Cassandra等。从本地文本文件创建如下:

val rddFromTextFile = sc.textFile("LICENSE")

Spark操作

一旦我们创建了一个RDD,我们就得到了一个可操作的分布式数据集。在spark编程模式下,操作分为转换(transformations)和动作(actions)。大体来说,转换对数据集提供了一些转变数据的方法。动作则会进行一些计算或者聚合,然后把结果返回到SparkContext运行的驱动程序中。

Spark中最常见的操作是map,将输入映射成另一种形式的输出,如下:

val intsFromStringRDD = rddFromTextFile.map(line => line.size)

=>的左边是输入,右边是输出。

通常情况下,除了多数动作(actions)外,spark操作会返回一个新的RDD,所以我们可以把操作串起来,这样可以使得程序简单明了,比如:

val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).
sum / rddFromTextFile.count

Spark的转换是lazy模式,在调用一个转换方法的时候并不会立即触发计算,而是将转换操作串在一起,在动作(action)被调用的时候才触发计算,这样spark可以更高效的返回结果给驱动程序,所以大多数操作都是以并行的方式在集群上运行。

这意味着,在spark程序中如果没有调用action,那么它永远不会触发实际的操作,也不会返回任何结果。

缓存RDD

Spark一个非常强大的功能是能够在集群中将数据缓存在内存中,可以通过调用cache方法来实现。
调用cache方法会告诉spark要把RDD保存在内存中,第一次调用action的时候,会初始化计算,从数据源读取数据并将它存入内存中。所以,这样的操作第一次被调用的时候,所花费的时间大部分取决于从数据源读取数据的时间。然后这部分数据第二次被访问的时候,比如机器学习中分析、迭代所用到的查询,着部分数据可以直接从内存中读取,因此避免了费时的I/O操作,提高了计算速度

广播变量(broadcast variables)和累加器(accumulators)

另一个Spark的核心功能是可以创建两种类型的变量:广播变量和累加器。

广播变量是只读变量,让SparkContext对象所在的驱动程序上的变量可以传到节点上进行计算。
在需要有效地将通一个数据变量传到其他工作节点(worker nodes)上的情况下,这很有用,比如机器学习算法。在spark中,创建一个广播变量和在SparkContext中调用一个方法一样简单,如下:

val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e"))

累加器也是一种可以广播给工作节点的变量,与广播变量不同的是广播变量是只读变量,而累加器可以在上面添加,这会有局限性:添加操作必须是联合操作,这样全局累加值可以正确地并行计算然后返回给驱动程序。每个工作节点只能访问并且添加它自己本地的累加器变量,并且只有驱动程序可以访问全局变量。

  • 3
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Build machine learning models, natural language processing applications, and recommender systems with PySpark to solve various business challenges. This book starts with the fundamentals of Spark and its evolution and then covers the entire spectrum of traditional machine learning algorithms along with natural language processing and recommender systems using PySpark. Machine Learning with PySpark shows you how to build supervised machine learning models such as linear regression, logistic regression, decision trees, and random forest. You'll also see unsupervised machine learning models such as K-means and hierarchical clustering. A major portion of the book focuses on feature engineering to create useful features with PySpark to train the machine learning models. The natural language processing section covers text processing, text mining, and embedding for classification. After reading this book, you will understand how to use PySpark's machine learning library to build and train various machine learning models. Additionally you'll become comfortable with related PySpark components, such as data ingestion, data processing, and data analysis, that you can use to develop data-driven intelligent applications. What You Will Learn Build a spectrum of supervised and unsupervised machine learning algorithms Implement machine learning algorithms with Spark MLlib libraries Develop a recommender system with Spark MLlib libraries Handle issues related to feature engineering, class balance, bias and variance, and cross validation for building an optimal fit model Who This Book Is For Data science and machine learning professionals.
Machine Learning with Spark - Second Edition by Rajdeep Dua English | 4 May 2017 | ASIN: B01DPR2ELW | 532 Pages | AZW3 | 9.6 MB Key Features Get to the grips with the latest version of Apache Spark Utilize Spark's machine learning library to implement predictive analytics Leverage Spark’s powerful tools to load, analyze, clean, and transform your data Book Description This book will teach you about popular machine learning algorithms and their implementation. You will learn how various machine learning concepts are implemented in the context of Spark ML. You will start by installing Spark in a single and multinode cluster. Next you'll see how to execute Scala and Python based programs for Spark ML. Then we will take a few datasets and go deeper into clustering, classification, and regression. Toward the end, we will also cover text processing using Spark ML. Once you have learned the concepts, they can be applied to implement algorithms in either green-field implementations or to migrate existing systems to this new platform. You can migrate from Mahout or Scikit to use Spark ML. By the end of this book, you will acquire the skills to leverage Spark's features to create your own scalable machine learning applications and power a modern data-driven business. What you will learn Get hands-on with the latest version of Spark ML Create your first Spark program with Scala and Python Set up and configure a development environment for Spark on your own computer, as well as on Amazon EC2 Access public machine learning datasets and use Spark to load, process, clean, and transform data Use Spark's machine learning library to implement programs by utilizing well-known machine learning models Deal with large-scale text data, including feature extraction and using text data as input to your machine learning models Write Spark functions to evaluate the performance of your machine learning models
Key Features, Get the most up-to-date book on the market that focuses on design, engineering, and scalable solutions in machine learning with Spark 2We use Spark's machine learning library in a big data environmentYou will learn to develop high-value applications at scale with ease and a personalized design, Book Description, Scaling out and deploying algorithms, interactions, and clustering are crucial steps in the process of optimizing any application. By maintaining and streaming data, Spark can figure out when to cache data in-memory, 100x faster than Hadoop and Mahoot. This means data streaming and analytics can run and complete jobs a lot quicker, making Spark ideal for large data-intensive applications., This book focuses on design, engineering, and scalable solutions in machine learning with Spark. You will learn how to install Spark with all new features as in the latest version Spark 2. You will also get to grips with Spark MLlib and Spark ML and its implementation for machine learning algorithms. Moving ahead, we'll explore about important concepts such as Dataframes and advanced feature engineering. After studying more about the development and deployment of an application, you will also find out about the other external libraries available for your data analysis., What you will learn, Solid theoretical understanding about machine learning algorithms and techniques for new and unknown datasetsSet up and configure Spark, and develop your first Spark application using Scala, Java, and SparkRUse ML and MLlib implement practical and large-scale machine learning pipelines and applications including collaborative filtering, classification, regression, clustering, association rule mining, twitter sentiment analysis, and dimensionality reductionScale up your machine learning application on large cluster or even cloud computing environment like Amazon EC2Enhance performance of your machine learning modelsTune your machine learning models for cross-validation, grid searching, hyperparameter tuning and train validation splitDeal with large-scale text data, including feature extraction and using text data as input to machine learning modelsDevelop machine learning application real-time streaming data using Spark Streaming
Pattern recognition and machine learning是一门涉及到模式识别和机器学习的课程,通过这门课程的学习,我对模式识别和机器学习有了更深入的了解。 在模式识别方面,我学习了如何使用统计学和概率论的知识对数据进行分析,识别出数据中的规律和模式。通过学习不同的模式识别算法,我了解了如何利用机器来识别图像、音频、文本甚至是生物特征等不同类型的模式。在机器学习方面,我学习了如何利用机器学习算法来训练模型,使得机器可以从数据中学习规律和模式,进而做出预测和决策。 通过学习这门课程,我对机器学习和模式识别的应用有了更清晰的认识,比如在图像识别、语音识别、自然语言处理、生物特征识别等领域的应用。我也学习到了如何应用这些知识和技术来解决现实生活中的问题,比如医疗诊断、金融风控、智能驾驶等领域的应用。 另外,通过课程中的实践项目,我有机会动手实践机器学习算法的应用,从数据的处理和特征提取到模型的训练和评估,这些实践使我对课程中学到的理论知识有了更深刻的理解。 总的来说,通过学习Pattern recognition and machine learning这门课程,我不仅对机器学习和模式识别的理论和技术有了更深入的了解,也掌握了一些实践应用的技能,对未来在相关领域的发展和应用有了更清晰的思路和认识。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值