Integrate Apache Spark with latest IPython Notebook (Jupyter 4.x)

Posted on December 24, 2015 | Topics:python, spark , ipython , jupyter , spark-redshift

As you may already know, Apache Spark is possibly the most popular engine right now for large-scale data processing, while IPython Notebook is a prominent front-end for, among other things, sharable exploratory data analysis. However, getting them to work with each other requires some additional step, especially since latest IPython Notebook has been moved under the Jupyter umbrella and doesn’t support profile anymore.

Table of Contents

  • I. Prerequisites
  • II. PySpark with IPython Shell
  • III. PySpark with Jupyter Notebook
  • IV. Bonus: Spark Redshift

I. Prerequisites

This setup has been tested with the following software:

  • Apache Spark 1.5.x
  • IPython 4.0.x (the interactive IPython shell & a kernel for Jupyter)
  • Jupyter 4.0.x (a web notebook on top of IPython kernel)
$ pyspark --version
$ ipython --version
$ jupyter --version

You will need to set an environment variable as follows:

  • SPARK_HOME : this is where the spark executables reside. For example, if you are on OSX and install Spark via homebrew, add this to your .bashrc or whatever.rc you use. This path differs between environment and installation, so if you don’t know where it is, Google is your friend.
$ echo "export SPARK_HOME='/usr/local/Cellar/apache-spark/1.5.2/libexec/'" >> ~/.bashrc

II. PySpark with IPython Shell

The following is adapted from Cloudera . I have removed some unnecessary steps if you just want to get up and running very quickly.

1. Step 1: Create an ipython profile

$ ipython profile create pyspark

# Possible outputs
# [ProfileCreate] Generating default config file: u'/Users/lim/.ipython/profile_spark/'
# [ProfileCreate] Generating default config file: u'/Users/lim/.ipython/profile_spark/'

2. Step 2: Create a startup file for this profile

The goal is to have a startup file for this profile so that everytime you launch an IPython interactive shell session, it loads the spark context for you.

$ touch ~/.ipython/profile_spark/startup/

A minimal working version of this file is

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/'))
execfile(os.path.join(spark_home, 'python/pyspark/'))

Verify that this works

$ ipython --profile=spark

You should see a welcome screen similar to this with the SparkContext object pre-created

Fig. 1 | Spark welcome screen

III. PySpark with Jupyter Notebook

After getting spark to work with IPython interactive shell, the next step is to get it to work with the Jupyter Notebook. Unfortunately, since the big split , Jupyter Notebook doesn’t support IPython profile out of the box anymore. To reuse the profile we created earlier, we are going to provide a modified IPython kernel for any spark-related notebook. The strategy is described here but it has some unnecessary boilerplates/outdated information, so here is an improved version:

1. Preparing the kernel spec

IPython kernel specs reside in ~/.ipython/kernels , so let’s create a spec for spark:

$ mkdir -p ~/.ipython/kernels/spark
$ touch ~/.ipython/kernels/spark/kernel.json

with the following content:

    "display_name": "PySpark (Spark 1.5.2)",
    "language": "python",
    "argv": [

Some notes: If you are using a virtual environment, change the python entry point to your virtualenvironment’s, e.g. mine is ~/.virtualenvs/machine-learning/bin/python

2. Profit

Now simply launch the notebook with

$ jupyter notebook
# ipython notebook works too

When creating a new notebook, select the PySpark kernel and go wild :)

Fig. 2 | Select PySpark kernel for a new Jupyter Notebook
Fig. 3 | Example of spark interacting with matplotlib

IV. Bonus: Spark-Redshift

Amazon Redshift is a popular choice for Data Warehousing and Analytics Database. Now you can easily load data from your Redshfit cluster into Spark’s native DataFrame using a spark package called Spark-Redshift . To hook it up with our jupyter notebook setup, add this to the kernel file

    "env": {
        "PYSPARK_SUBMIT_ARGS": "--jars </path/to/redshift/jdbc.jar> --packages com.databricks:spark-redshift_2.10:0.5.2,com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell"

Please note that you need a JDBC drive for Redshfit, which can be downloadedhere . The last tricky thing to note is that the package uses Amazon S3 as the transportation medium, so you will need to configure the spark context object with your AWS credentials. This could be done on top of your notebook with:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")

Whew! I admit it’s a bit long but totally worth the trouble. Spark + IPython on top of Redshift is a very formidable combination for exploring your data at scale.

个人分类: Python
想对作者说点什么? 我来说一句



Integrate Apache Spark with latest IPython Notebook (Jupyter 4.x)