Integrate Apache Spark with latest IPython Notebook (Jupyter 4.x)

转载 2016年05月30日 10:51:21

Posted on December 24, 2015 | Topics:python, spark , ipython , jupyter , spark-redshift

As you may already know, Apache Spark is possibly the most popular engine right now for large-scale data processing, while IPython Notebook is a prominent front-end for, among other things, sharable exploratory data analysis. However, getting them to work with each other requires some additional step, especially since latest IPython Notebook has been moved under the Jupyter umbrella and doesn’t support profile anymore.

Table of Contents

  • I. Prerequisites
  • II. PySpark with IPython Shell
  • III. PySpark with Jupyter Notebook
  • IV. Bonus: Spark Redshift

I. Prerequisites

This setup has been tested with the following software:

  • Apache Spark 1.5.x
  • IPython 4.0.x (the interactive IPython shell & a kernel for Jupyter)
  • Jupyter 4.0.x (a web notebook on top of IPython kernel)
$ pyspark --version
$ ipython --version
$ jupyter --version

You will need to set an environment variable as follows:

  • SPARK_HOME : this is where the spark executables reside. For example, if you are on OSX and install Spark via homebrew, add this to your .bashrc or whatever.rc you use. This path differs between environment and installation, so if you don’t know where it is, Google is your friend.
$ echo "export SPARK_HOME='/usr/local/Cellar/apache-spark/1.5.2/libexec/'" >> ~/.bashrc

II. PySpark with IPython Shell

The following is adapted from Cloudera . I have removed some unnecessary steps if you just want to get up and running very quickly.

1. Step 1: Create an ipython profile

$ ipython profile create pyspark

# Possible outputs
# [ProfileCreate] Generating default config file: u'/Users/lim/.ipython/profile_spark/'
# [ProfileCreate] Generating default config file: u'/Users/lim/.ipython/profile_spark/'

2. Step 2: Create a startup file for this profile

The goal is to have a startup file for this profile so that everytime you launch an IPython interactive shell session, it loads the spark context for you.

$ touch ~/.ipython/profile_spark/startup/

A minimal working version of this file is

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/'))
execfile(os.path.join(spark_home, 'python/pyspark/'))

Verify that this works

$ ipython --profile=spark

You should see a welcome screen similar to this with the SparkContext object pre-created

Fig. 1 | Spark welcome screen

III. PySpark with Jupyter Notebook

After getting spark to work with IPython interactive shell, the next step is to get it to work with the Jupyter Notebook. Unfortunately, since the big split , Jupyter Notebook doesn’t support IPython profile out of the box anymore. To reuse the profile we created earlier, we are going to provide a modified IPython kernel for any spark-related notebook. The strategy is described here but it has some unnecessary boilerplates/outdated information, so here is an improved version:

1. Preparing the kernel spec

IPython kernel specs reside in ~/.ipython/kernels , so let’s create a spec for spark:

$ mkdir -p ~/.ipython/kernels/spark
$ touch ~/.ipython/kernels/spark/kernel.json

with the following content:

    "display_name": "PySpark (Spark 1.5.2)",
    "language": "python",
    "argv": [

Some notes: If you are using a virtual environment, change the python entry point to your virtualenvironment’s, e.g. mine is ~/.virtualenvs/machine-learning/bin/python

2. Profit

Now simply launch the notebook with

$ jupyter notebook
# ipython notebook works too

When creating a new notebook, select the PySpark kernel and go wild :)

Fig. 2 | Select PySpark kernel for a new Jupyter Notebook
Fig. 3 | Example of spark interacting with matplotlib

IV. Bonus: Spark-Redshift

Amazon Redshift is a popular choice for Data Warehousing and Analytics Database. Now you can easily load data from your Redshfit cluster into Spark’s native DataFrame using a spark package called Spark-Redshift . To hook it up with our jupyter notebook setup, add this to the kernel file

    "env": {
        "PYSPARK_SUBMIT_ARGS": "--jars </path/to/redshift/jdbc.jar> --packages com.databricks:spark-redshift_2.10:0.5.2,com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell"

Please note that you need a JDBC drive for Redshfit, which can be downloadedhere . The last tricky thing to note is that the package uses Amazon S3 as the transportation medium, so you will need to configure the spark context object with your AWS credentials. This could be done on top of your notebook with:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")

Whew! I admit it’s a bit long but totally worth the trouble. Spark + IPython on top of Redshift is a very formidable combination for exploring your data at scale.



Linux安装远程ipython notebook

配置服务器的ipython,这样就可以通过浏览器连接远程ipython进行数据分析和其他的操作了。 这里以虚拟机中的ubuntu为例,用virtualbox安装ubuntu,安装ssh,xshell...

Jupyter Notebook远程服务器配置

Jupyter Notebook远程服务器配置

精选:深入理解 Docker 内部原理及网络配置

网络绝对是任何系统的核心,对于容器而言也是如此。Docker 作为目前最火的轻量级容器技术,有很多令人称道的功能,如 Docker 的镜像管理。然而,Docker的网络一直以来都比较薄弱,所以我们有必要深入了解Docker的网络知识,以满足更高的网络需求。

远程链接Ipython Notebook配置

工具帖(昨天为了测试spark性能,打算跑一些数据分析,远程安装ipython的时候不小心在home下rm -rf *,真的是… due前五个小时… 更糟糕的是我们正好这一次忘记把代码push到git...

[pySpark][笔记]spark tutorial from spark official site在ipython notebook 下学习pySpark

+ Spark Tutorial: Learning Apache SparkThis tutorial will teach you how to use Apache Spark, a frame...



pyspark使用 jupyter ,matplotlib, ipython

export PYSPARK_DRIVER_PYTHON=jupyter export IPYTHON=1 export PYSPARK_DRIVER_PYTHON_OPTS="jupyter not...

在Jupyter notebook中配置和使用spark

步骤1:安装jupyter 这里安装集成环境包Anaconda 下载地址及安装方法: 步骤2: 下载spark http://sp...


效果图 简介 Spark Kernel的安装 Spark Kernel旧的项目 Toree新项目 Spark组件单独安装 Scala Kernel的安装 PySpark的安装效果图无图无真相,以下是运...

详解 jupyter notebook 集成 spark 环境安装

来自: 代码大湿 代码大湿1 相关介绍 jupyter notebook是一个Web应用程序,允许你创建和分享,包含活的代码,方程的文件,可视化和解释性文字。用途包括:数据的...


转自: 我们知道在Linux下网卡被称为eth0,eth1,eth2.....,所有网卡的...