apache spark_使用Apache Spark引入大数据

apache spark

Last time we reviewed the wonderful Vowpal Wabbit tool, which can be useful in cases when you have to train on samples that do not fit into RAM. Recall that a feature of this tool is that it allows you to build, first of all, linear models (which, by the way, have a good generalizing ability), and the high quality of algorithms is achieved through selection and generation of features, regularization and other additional techniques. Today we will consider a tool that is more popular and designed for processing large amounts of data — Apache Spark .

上次我们回顾了精彩的Vowpal Wabbit工具,该工具在必须训练不适合RAM的样本的情况下很有用。 回想一下该工具的功能是,它允许您首先构建线性模型(顺便说一句,它具有良好的泛化能力),并且通过选择和生成特征可以实现高质量的算法,正则化和其他附加技术。 今天,我们将考虑一种更流行且设计用于处理大量数据的工具-Apache Spark

We will not go into the details of the history of this instrument, as well as its internal structure. Let’s focus on practical things. In this article we will look at basic operations and basic things that can be done in Spark, and next time we will take a closer look at the machine learning MlLib library , as well as GraphX.for processing graphs (the author of this post mainly uses this tool for this — this is exactly the case when the graph often needs to be kept in RAM on the cluster, while Vowpal Wabbit is often enough for machine learning). There won’t be a lot of code in this tutorial, because discusses the basic concepts and philosophy of Spark. In the next articles (about MlLib and GraphX) we will take some dataset and take a closer look at Spark in practice.

我们不会详细介绍该乐器的历史及其内部结构。 让我们专注于实际的事情。 在本文中,我们将介绍可在Spark中完成的基本操作和基本操作,而下一次,我们将更深入地研究机器学习MlLib库和GraphX。 用于处理图形(本文的作者主要使用此工具-当图形通常需要保存在群集的RAM中,而Vowpal Wabbit通常足以进行机器学习时,正是这种情况)。 由于讨论了Spark的基本概念和原理,本教程中不会有很多代码。 在下一篇文章(关于MlLib和GraphX)中,我们将获取一些数据集,并在实践中仔细研究Spark。

Let’s say right away that Spark natively supports Scala , Python and Java . We will consider examples in Python, since it is very convenient to work directly in IPython Notebook , unloading a small part of data from the cluster and processing, for example, in a packagePandas turns out to be a pretty convenient bundle

让我们马上说说Spark本身支持ScalaPythonJava 。 我们将考虑使用Python的示例,因为直接在IPython Notebook中工作非常方便,从集群中卸载一小部分数据并进行处理,例如,打包成一个Pandas包非常方便

So, let’s start with the fact that the main concept in Spark is RDD (Resilient Distributed Dataset) , which is a Dataset, on which you can do two types of transformations (and, accordingly, all work with these structures is in the sequence of these two actions).

因此,让我们从一个事实开始,Spark的主要概念是RDD(弹性分布式数据集) ,它是一个数据集,可以在其上进行两种类型的转换(因此,使用这些结构的所有工作都按以下顺序进行:这两个动作)。

Image for post

转变 (Transformations)

The result of applying this operation to an RDD is a new RDD. As a rule, these are operations that transform the elements of a given dataset in some way. Here is an incomplete list of the most common conversions, each of which returns a new dataset (RDD):

将此操作应用于RDD的结果是新的RDD。 通常,这些操作以某种方式转换给定数据集的元素。 这是最常见的转换的不完整列表,每个转换都返回一个新的数据集(RDD):

.map (function) — applies a function function to each element of the

.map(函数) —将函数功能应用于元素的每个元素

dataset .filter (function) — returns all elements of the dataset on which function returned a true value

数据集.filter(函数) —返回数据集上所有函数返回真值的元素

.distinct ( [numTasks]) — returns a dataset that contains unique elements of the original dataset

.distinct([numTasks]) —返回包含原始数据集的唯一元素的数据集

It is also worth noting about operations on sets, the meaning of which is clear from the names:

还值得注意的是对集合的操作,其含义从名称中显而易见:

.union (otherDataset)

.union(otherDataset)

.intersection (otherDataset)

.intersection(otherDataset)

.cartesian (otherDataset) — the new dataset contains all kinds of pairs (A, B), where the first element belongs to the original dataset, and the second belongs to the argument dataset

.cartesian(otherDataset) -新数据集包含各种对(A,B),其中第一个元素属于原始数据集,第二个元素属于参数数据集

动作 (Actions)

Actions are used when it is necessary to materialize the result — as a rule, save data to disk, or output part of the data to the console. Here is a list of the most common actions that can be applied to RDDs:

当需要具体化结果时使用操作-通常,将数据保存到磁盘,或将部分数据输出到控制台。 以下是可应用于RDD的最常见操作的列表:

.saveAsTextFile (path) — saves data to a text file (to hdfs, to a local machine, or to any other supported file system — see the documentation for a complete list)

.saveAsTextFile(路径) —将数据保存到文本文件(保存到hdfs,本地计算机或任何其他受支持的文件系统,请参阅文档以获取完整列表)

.collect () — returns the elements of the dataset as an array. As a rule, this is used in cases where there is already little data in the dataset (various filters and transformations have been applied) — and visualization is needed, or additional data analysis, for example, using the Pandas package

.collect() —以数组形式返回数据集的元素。 通常,在数据集中的数据已经很少(已应用各种过滤器和转换)的情况下使用此方法,并且需要进行可视化或其他数据分析(例如,使用Pandas程序包)

.take (n) — returns the first n elements of the dataset as an array

.take(n) —以数组的形式返回数据集的前n个元素

.count () — returns the number of items in the dataset

.count() —返回数据集中的项目数

.reduce (function) is a familiar operation for those familiar with MapReduce . From the mechanism of this operation it follows that the function function (which takes 2 arguments as input and returns one value) must be commutative and associative.

.reduce(函数)是对MapReduce熟悉的人熟悉的操作。 从此操作的机制可以得出结论,函数功能(将2个参数作为输入并返回一个值)必须是可交换的和关联的。

These are the basics that you need to know when working with the tool. Now let’s do a little practice and show how to load data into Spark and do simple calculations

这些是使用该工具时需要了解的基础知识。 现在让我们进行一些练习,并展示如何将数据加载到Spark中并进行简单的计算

with it.When Spark starts, the first thing to do is to create a SparkContext (in simple words, this is an object that is responsible for implementing lower-level operations with a cluster — for details — see the documentation), which at startupSpark-Shell is created automatically and available immediately ( sc object )

当Spark启动时,要做的第一件事就是创建一个SparkContext (简单地说,这是一个负责通过集群实现低级操作的对象(有关详细信息,请参阅文档)),该对象在启动时Spark-Shell将自动创建并立即可用( sc对象)

加载数据中 (Loading data)

There are two ways to load data into Spark:

有两种方法可以将数据加载到Spark中:

a). Directly from a local program using the .parallelize (data) function

一个)。 使用.parallelize(data)函数直接从本地程序

localData = [5,7,1,12,10,25]
ourFirstRDD = sc.parallelize(localData)

b). From supported repositories (e.g. hdfs) using the .textFile (path) function

b)。 使用.textFile(path)函数从受支持的存储库(例如hdfs)中

ourSecondRDD = sc.textFile("path to some data on the cluster")

At this point, it is important to note one feature of data storage in Spark and at the same time the most useful function .cache () (partly thanks to which Spark has become so popular), which allows you to cache data in RAM (taking into account the availability of the latter). This allows for iterative calculations in memory, thereby getting rid of IO-overhead. This is especially important in the context of machine learning and graph computing, since most algorithms are iterative — ranging from gradient methods to algorithms such as PageRank

在这一点上,重要的是要注意Spark中数据存储的一项功能,同时要注意最有用的函数.cache() (部分原因是Spark变得如此流行),它允许您将数据缓存在RAM中(考虑到后者的可用性)。 这允许在内存中进行迭代计算,从而摆脱了IO开销。 这在机器学习和图形计算的环境中尤其重要,因为大多数算法都是迭代的-从渐变方法到诸如PageRank的算法

处理数据 (Working with data)

After loading the data into the RDD, we can do various transformations and actions on it, which were mentioned above. For example:

将数据加载到RDD中后,我们可以对其进行各种转换和操作,如上所述。 例如:

Let’s see the first few elements:

让我们看一下前几个元素:

for item in ourRDD.top(10): 
print item

Or immediately load these elements into Pandas and work with the DataFrame:

或者立即将这些元素加载到Pandas中并使用DataFrame:

import pandas as pd
pd.DataFrame(ourRDD.map(lambda x: x.split(";")[:]).top(10))

In general, as you can see, Spark is so convenient that further, there is probably no point in writing various examples, but you can simply leave this exercise to the reader — many calculations are written literally in a few lines.

总体上,如您所见,Spark非常方便,以至于进一步,编写各种示例可能没有意义,但是您可以将本练习留给读者-许多计算实际上是用几行编写的。

Finally, we will show only an example of transformation, namely, we calculate the maximum and minimum elements of our dataset. As you might guess, this can be done, for example, using the .reduce () function :

最后,我们将仅展示一个转换示例,即,我们计算数据集的最大和最小元素。 您可能会猜到,例如,可以使用.reduce()函数来完成此操作:

localData = [5,7,1,12,10,25]
ourRDD = sc.parallelize(localData)
print ourRDD.reduce(max)
print ourRDD.reduce(min)

So, we have covered the basic concepts necessary to work with the tool. We did not consider working with SQL, working with pairs <key, value> (which is easy — for this, it is enough to first apply to the RDD, for example, a filter to select a key, and then it is already easy to use built-in functions, like sortByKey , countByKey , join , etc.) — the reader is invited to get acquainted with this on his own, and if he has any questions, write in the comments. As already noted, next time we will consider in detail the MlLib library and, separately, GraphX

因此,我们已经介绍了使用该工具所必需的基本概念。 我们没有考虑使用SQL,使用成对的<key,value> (这很容易-为此,首先将其应用于RDD就足够了,例如,选择一个键的过滤器,然后就已经很容易了以使用诸如sortByKeycountByKeyjoin等内置函数)-邀请读者自己熟悉它,如果他有任何问题,请在评论中写下。 如前所述,下次我们将详细考虑MlLib库以及GraphX,

Source: https://habr.com/ru/

资料来源: https : //habr.com/ru/

翻译自: https://medium.com/swlh/introduction-into-apache-spark-8382874b682f

apache spark

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值