Integrate Apache Spark with latest IPython Notebook (Jupyter 4.x)

转载 2016年05月30日 10:51:21

Posted on December 24, 2015 | Topics:python, spark , ipython , jupyter , spark-redshift

As you may already know, Apache Spark is possibly the most popular engine right now for large-scale data processing, while IPython Notebook is a prominent front-end for, among other things, sharable exploratory data analysis. However, getting them to work with each other requires some additional step, especially since latest IPython Notebook has been moved under the Jupyter umbrella and doesn’t support profile anymore.

Table of Contents

  • I. Prerequisites
  • II. PySpark with IPython Shell
  • III. PySpark with Jupyter Notebook
  • IV. Bonus: Spark Redshift

I. Prerequisites

This setup has been tested with the following software:

  • Apache Spark 1.5.x
  • IPython 4.0.x (the interactive IPython shell & a kernel for Jupyter)
  • Jupyter 4.0.x (a web notebook on top of IPython kernel)
$ pyspark --version
$ ipython --version
$ jupyter --version

You will need to set an environment variable as follows:

  • SPARK_HOME : this is where the spark executables reside. For example, if you are on OSX and install Spark via homebrew, add this to your .bashrc or whatever.rc you use. This path differs between environment and installation, so if you don’t know where it is, Google is your friend.
$ echo "export SPARK_HOME='/usr/local/Cellar/apache-spark/1.5.2/libexec/'" >> ~/.bashrc

II. PySpark with IPython Shell

The following is adapted from Cloudera . I have removed some unnecessary steps if you just want to get up and running very quickly.

1. Step 1: Create an ipython profile

$ ipython profile create pyspark

# Possible outputs
# [ProfileCreate] Generating default config file: u'/Users/lim/.ipython/profile_spark/ipython_config.py'
# [ProfileCreate] Generating default config file: u'/Users/lim/.ipython/profile_spark/ipython_kernel_config.py'

2. Step 2: Create a startup file for this profile

The goal is to have a startup file for this profile so that everytime you launch an IPython interactive shell session, it loads the spark context for you.

$ touch ~/.ipython/profile_spark/startup/00-spark-setup.py

A minimal working version of this file is

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Verify that this works

$ ipython --profile=spark

You should see a welcome screen similar to this with the SparkContext object pre-created

Fig. 1 | Spark welcome screen

III. PySpark with Jupyter Notebook

After getting spark to work with IPython interactive shell, the next step is to get it to work with the Jupyter Notebook. Unfortunately, since the big split , Jupyter Notebook doesn’t support IPython profile out of the box anymore. To reuse the profile we created earlier, we are going to provide a modified IPython kernel for any spark-related notebook. The strategy is described here but it has some unnecessary boilerplates/outdated information, so here is an improved version:

1. Preparing the kernel spec

IPython kernel specs reside in ~/.ipython/kernels , so let’s create a spec for spark:

$ mkdir -p ~/.ipython/kernels/spark
$ touch ~/.ipython/kernels/spark/kernel.json

with the following content:

{
    "display_name": "PySpark (Spark 1.5.2)",
    "language": "python",
    "argv": [
        "/usr/bin/python2",
        "-m",
        "ipykernel",
        "--profile=spark"
        "-f",
        "{connection_file}"
    ]
}

Some notes: If you are using a virtual environment, change the python entry point to your virtualenvironment’s, e.g. mine is ~/.virtualenvs/machine-learning/bin/python

2. Profit

Now simply launch the notebook with

$ jupyter notebook
# ipython notebook works too

When creating a new notebook, select the PySpark kernel and go wild :)

Fig. 2 | Select PySpark kernel for a new Jupyter Notebook
Fig. 3 | Example of spark interacting with matplotlib

IV. Bonus: Spark-Redshift

Amazon Redshift is a popular choice for Data Warehousing and Analytics Database. Now you can easily load data from your Redshfit cluster into Spark’s native DataFrame using a spark package called Spark-Redshift . To hook it up with our jupyter notebook setup, add this to the kernel file

{
    ...
    "env": {
        "PYSPARK_SUBMIT_ARGS": "--jars </path/to/redshift/jdbc.jar> --packages com.databricks:spark-redshift_2.10:0.5.2,com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell"
    }
    ...
}

Please note that you need a JDBC drive for Redshfit, which can be downloadedhere . The last tricky thing to note is that the package uses Amazon S3 as the transportation medium, so you will need to configure the spark context object with your AWS credentials. This could be done on top of your notebook with:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")

Whew! I admit it’s a bit long but totally worth the trouble. Spark + IPython on top of Redshift is a very formidable combination for exploring your data at scale.

相关文章推荐

Jupyter配置Spark开发环境

效果图 简介 Spark Kernel的安装 Spark Kernel旧的项目 Toree新项目 Spark组件单独安装 Scala Kernel的安装 PySpark的安装效果图无图无真相,以下是运...

详解 jupyter notebook 集成 spark 环境安装

来自: 代码大湿 代码大湿1 相关介绍 jupyter notebook是一个Web应用程序,允许你创建和分享,包含活的代码,方程的文件,可视化和解释性文字。用途包括:数据的...

在Jupyter notebook中配置和使用spark

步骤1:安装jupyter 这里安装集成环境包Anaconda 下载地址及安装方法:https://www.continuum.io/downloads 步骤2: 下载spark http://sp...

jupyter与spark kernel结合的notebook安装及使用

Install Jupyter and Python Install a kernel other than Python(以Scala为例) Install Spark-kernel...

Spark+Jupyter=在线文本数据处理逻辑测试平台

最近在学习Spark,除了原生的Scala以外,Spark还提供了一个pyspark支持Python。以前曾经搭过一个基于IPython notebook的在线代码运行平台,其中用到的numpy,sc...

安装使用jupyter(原来的notebook)

安装使用jupyter(原来的notebook)

spark2.0下实现IPYTHON3.5开发,兼配置jupyter,notebook降低python开发难度

spark2.0下实现IPYTHON3.5开发1、spark2.0安装就不说了,网上有很多,不会的话给我留言。 2、我们在spark2.0下用python开发的话,不需要安装python了,直接安装...

BerkeleyX CS100.1x"Introduction to Big Data with Apache Spark"环境搭建

最近想学习一些跟spark相关的教程。然后看到伯克利开了一门课,BerkeleyX CS100.1x "Introduction to Big Data with Apache Spark"。首先需要...
  • pfkmldf
  • pfkmldf
  • 2015年12月04日 15:55
  • 646

Big Data with Apache Spark and Python mobi

  • 2017年10月03日 12:56
  • 70KB
  • 下载

Big Data with Apache Spark and Python epub

  • 2017年10月03日 12:54
  • 143KB
  • 下载
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Integrate Apache Spark with latest IPython Notebook (Jupyter 4.x)
举报原因:
原因补充:

(最多只允许输入30个字)