Integrate Apache Spark with latest IPython Notebook (Jupyter 4.x)

转载 2016年05月30日 10:51:21

Posted on December 24, 2015 | Topics:python, spark , ipython , jupyter , spark-redshift

As you may already know, Apache Spark is possibly the most popular engine right now for large-scale data processing, while IPython Notebook is a prominent front-end for, among other things, sharable exploratory data analysis. However, getting them to work with each other requires some additional step, especially since latest IPython Notebook has been moved under the Jupyter umbrella and doesn’t support profile anymore.

Table of Contents

  • I. Prerequisites
  • II. PySpark with IPython Shell
  • III. PySpark with Jupyter Notebook
  • IV. Bonus: Spark Redshift

I. Prerequisites

This setup has been tested with the following software:

  • Apache Spark 1.5.x
  • IPython 4.0.x (the interactive IPython shell & a kernel for Jupyter)
  • Jupyter 4.0.x (a web notebook on top of IPython kernel)
$ pyspark --version
$ ipython --version
$ jupyter --version

You will need to set an environment variable as follows:

  • SPARK_HOME : this is where the spark executables reside. For example, if you are on OSX and install Spark via homebrew, add this to your .bashrc or whatever.rc you use. This path differs between environment and installation, so if you don’t know where it is, Google is your friend.
$ echo "export SPARK_HOME='/usr/local/Cellar/apache-spark/1.5.2/libexec/'" >> ~/.bashrc

II. PySpark with IPython Shell

The following is adapted from Cloudera . I have removed some unnecessary steps if you just want to get up and running very quickly.

1. Step 1: Create an ipython profile

$ ipython profile create pyspark

# Possible outputs
# [ProfileCreate] Generating default config file: u'/Users/lim/.ipython/profile_spark/'
# [ProfileCreate] Generating default config file: u'/Users/lim/.ipython/profile_spark/'

2. Step 2: Create a startup file for this profile

The goal is to have a startup file for this profile so that everytime you launch an IPython interactive shell session, it loads the spark context for you.

$ touch ~/.ipython/profile_spark/startup/

A minimal working version of this file is

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/'))
execfile(os.path.join(spark_home, 'python/pyspark/'))

Verify that this works

$ ipython --profile=spark

You should see a welcome screen similar to this with the SparkContext object pre-created

Fig. 1 | Spark welcome screen

III. PySpark with Jupyter Notebook

After getting spark to work with IPython interactive shell, the next step is to get it to work with the Jupyter Notebook. Unfortunately, since the big split , Jupyter Notebook doesn’t support IPython profile out of the box anymore. To reuse the profile we created earlier, we are going to provide a modified IPython kernel for any spark-related notebook. The strategy is described here but it has some unnecessary boilerplates/outdated information, so here is an improved version:

1. Preparing the kernel spec

IPython kernel specs reside in ~/.ipython/kernels , so let’s create a spec for spark:

$ mkdir -p ~/.ipython/kernels/spark
$ touch ~/.ipython/kernels/spark/kernel.json

with the following content:

    "display_name": "PySpark (Spark 1.5.2)",
    "language": "python",
    "argv": [

Some notes: If you are using a virtual environment, change the python entry point to your virtualenvironment’s, e.g. mine is ~/.virtualenvs/machine-learning/bin/python

2. Profit

Now simply launch the notebook with

$ jupyter notebook
# ipython notebook works too

When creating a new notebook, select the PySpark kernel and go wild :)

Fig. 2 | Select PySpark kernel for a new Jupyter Notebook
Fig. 3 | Example of spark interacting with matplotlib

IV. Bonus: Spark-Redshift

Amazon Redshift is a popular choice for Data Warehousing and Analytics Database. Now you can easily load data from your Redshfit cluster into Spark’s native DataFrame using a spark package called Spark-Redshift . To hook it up with our jupyter notebook setup, add this to the kernel file

    "env": {
        "PYSPARK_SUBMIT_ARGS": "--jars </path/to/redshift/jdbc.jar> --packages com.databricks:spark-redshift_2.10:0.5.2,com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell"

Please note that you need a JDBC drive for Redshfit, which can be downloadedhere . The last tricky thing to note is that the package uses Amazon S3 as the transportation medium, so you will need to configure the spark context object with your AWS credentials. This could be done on top of your notebook with:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")

Whew! I admit it’s a bit long but totally worth the trouble. Spark + IPython on top of Redshift is a very formidable combination for exploring your data at scale.

详解 jupyter notebook 集成 spark 环境安装

来自: 代码大湿 代码大湿1 相关介绍 jupyter notebook是一个Web应用程序,允许你创建和分享,包含活的代码,方程的文件,可视化和解释性文字。用途包括:数据的...
  • u014612752
  • u014612752
  • 2016年07月03日 15:05
  • 5150

在Jupyter notebook中配置和使用spark

步骤1:安装jupyter 这里安装集成环境包Anaconda 下载地址及安装方法: 步骤2: 下载spark http://sp...
  • cheng9981
  • cheng9981
  • 2017年02月22日 19:43
  • 2643


效果图 简介 Spark Kernel的安装 Spark Kernel旧的项目 Toree新项目 Spark组件单独安装 Scala Kernel的安装 PySpark的安装效果图无图无真相,以下是运...
  • u012948976
  • u012948976
  • 2016年08月30日 20:34
  • 4814


最近在学习Spark,除了原生的Scala以外,Spark还提供了一个pyspark支持Python。以前曾经搭过一个基于IPython notebook的在线代码运行平台,其中用到的numpy,sc...
  • caizezhi1
  • caizezhi1
  • 2016年08月04日 15:23
  • 1135

jupyter与spark kernel结合的notebook安装及使用

Install Jupyter and Python Install a kernel other than Python(以Scala为例) Install Spark-kernel...
  • heng_2218
  • heng_2218
  • 2016年03月29日 15:04
  • 9636

Ubuntu系统下IPython Notebook的远程访问配置

本文讲述了Ubuntu系统下IPython Notebook的远程访问配置流程
  • ggGavin
  • ggGavin
  • 2016年11月15日 21:20
  • 3448

Python·Jupyter Notebook各种使用方法记录·持续更新

Python·Jupyter Notebook各种使用方法记录·持续更新标签(空格分隔): PythonPythonJupyter Notebook各种使用方法记录持续更新 一 Jupyter Not...
  • tina_ttl
  • tina_ttl
  • 2016年05月14日 12:23
  • 127945


本文记录下我的电脑下安装iPython-notebook 的过程。 简介: 我的系统是OSX EI-Capitan 10.11,python是2.7.10,ipython版本为4.0.0....
  • whiterbear
  • whiterbear
  • 2015年10月23日 09:52
  • 17144

Caffe学习系列(13):数据可视化环境(python接口)配置 jupyter notebook

caffe程序是由c++语言写的,本身是不带数据可视化功能的。只能借助其它的库或接口,如opencv, python或matlab。大部分人使用python接口来进行可视化,因为python出了个比较...
  • haoji007
  • haoji007
  • 2016年12月28日 21:00
  • 1364


jupyter配置scala和Spark支持。 Jupyter Notebook(此前被称为 IPython Notebook)是一个交互式笔记本,支持运行 40 多种编程语言。Jupyter Not...
  • qq_30901367
  • qq_30901367
  • 2017年06月15日 18:27
  • 1269
您举报文章:Integrate Apache Spark with latest IPython Notebook (Jupyter 4.x)