如何将Jupyter Notebook连接到远程Spark集群并每天运行Spark作业？

最新推荐文章于 2024-05-14 09:47:13 发布

weixin_26746401

最新推荐文章于 2024-05-14 09:47:13 发布

阅读量1.5k

点赞数

文章标签： spark 大数据 hadoop python

原文链接：https://towardsdatascience.com/how-to-connect-jupyter-notebook-to-remote-spark-clusters-and-run-spark-jobs-every-day-2c5a0c1b61df

版权

As a data scientist, you are developing notebooks that process large data that does not fit in your laptop using Spark. What would you do? This is not a trivial problem. 作为数据科学家，您正在开发使用Spark处理笔记本电脑无法容...

摘要由CSDN通过智能技术生成

As a data scientist, you are developing notebooks that process large data that does not fit in your laptop using Spark. What would you do? This is not a trivial problem.

作为数据科学家，您正在开发使用Spark处理笔记本电脑无法容纳的大数据的笔记本电脑。你会怎么做？这不是一个小问题。

Let’s start with the most naive solution without install anything on your laptop.

让我们从最简单的解决方案开始，而不在笔记本电脑上安装任何东西。

“No notebook”: SSH into the remote clusters and use Spark shell on the remote cluster.
“无笔记本”：SSH进入远程群集，并在远程群集上使用Spark Shell。
“Local notebook”: downsample the data and pull the data to your laptop.
“本地笔记本”：对数据进行下采样并将数据拉到您的笔记本上。

The problem of “No notebook” is the developer experience is unacceptable on Spark shell:

“没有笔记本”的问题是在Spark shell上无法接受开发人员的体验：

You cannot easily change the code and get the result printed like what you have in Jupyter notebook or Zeppelin notebook.
您无法像Jupyter笔记本电脑或Zeppelin笔记本电脑那样轻松地更改代码并获得打印结果。
It is hard to show images/charts from Shell.
很难显示来自Shell的图像/图表。
It is painful to do version control by git on a remote machine because you have to set up from the very beginning and make git operations like git diff.
在远程计算机上通过git进行版本控制很痛苦，因为您必须从一开始就进行设置并进行git diff之类的git操作。

The second option “Local notebook”: You have to downsample the data and pull the data to your laptop (downsample: if you have 100GB data on your clusters, you downsample the data to 1GB without losing too much important information). Then you could process the data on your local Jupyter notebook.

第二个选项“本地笔记本”：您必须对数据进行降采样并将数据拉至笔记本电脑(降采样：如果群集上有100GB数据，则可以将数据降采样为1GB，而不会丢失太多重要信息)。然后，您可以在本地Jupyter笔记本上处理数据。

it creates a few new painful problems:

它带来了一些新的痛苦问题：

You have to write extra code to downsample the data.
您必须编写额外的代码才能对数据进行下采样。
Downsampling could lose vital information about the data, especially when you are working on visualization or machine learning models.
下采样可能会丢失有关数据的重要信息，尤其是在使用可视化或机器学习模型时。
You have to spend extra hours to make sure your code for original data. If not, it takes extra extra hours to figure out what’s wrong.
您必须花费额外的时间来确保原始数据的代码。如果不是这样，则需要花费额外的时间才能找出问题所在。
You have to guarantee the local development environment is the same as the remote cluster. If not, it is error-prone and it may cause data issues that are hard to detect.
您必须保证本地开发环境与远程集群相同。如果不是这样，则容易出错，并且可能导致难以检测的数据问题。

Ok, “No notebook” and “Local notebook” are obviously not the best approach. What if your data team has access to the cloud, e.g. AWS? Yes, AWS provides Jupyter notebook on their EMR clusters and SageMaker. The notebook server is accessed through AWS Web console and it is ready to use when the clusters are ready.

好的，“没有笔记本”和“本地笔记本”显然不是最好的方法。如果您的数据团队可以访问云(例如AWS)怎么办？是的，AWS在其EMR群集和SageMaker上提供Jupyter笔记本。笔记本服务器可通过AWS Web控制台访问，并且在群集准备就绪后即可使用。

This approach is called “Remote notebook on a cloud”.

这种方法称为“云上的远程笔记本”。

最低0.47元/天解锁文章

weixin_26746401

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
如何将Jupyter Notebook连接到远程Spark集群并每天运行Spark作业？

As a data scientist, you are developing notebooks that process large data that does not fit in your laptop using Spark. What would you do? This is not a trivial problem. 作为数据科学家，您正在开发使用Spark处理笔记本电脑无法容...
复制链接

扫一扫