Apache Spark is a great way for performing large-scale data processing. Lately, I have begun working with PySpark, a way of interfacing with Spark through Python. After a discussion with a coworker, we were curious whether PySpark could run from within an IPython Notebook. It turns out that this is fairly straightforward by setting up an IPython profile.
Here’s the tl;dr
summary:
- Install Spark
- Create PySpark profile for IPython
- Some config
- Simple word count example
The steps below were successfully executed using Mac OS X 10.10.2 andHomebrew. The majority of the steps should be similar for non-Windows environments. For demonstration purposes, Spark will run in local mode, but the configuration can be updated to submit code to a cluster.
Many thanks to my coworker Steve Wampler who did much of the work.
Installing Spark
- Download the source for the latest Spark release
- Unzip source to
~/spark-1.2.0/
(or wherever you wish to install Spark) - From the CLI, type:
cd ~/spark-1.2.0/
- Install the Scala build tool:
brew install sbt
- Build Spark:
sbt assembly
(Takes a while)
Create PySpark Profile for IPython
After Spark is installed, let’s start by creating a new IPython profile for PySpark.
ipython profile create pyspark
To avoid port conflicts with other IPython profiles, I updated the default port to42424
within ~/.ipython/profile_pyspark/ipython_notebook_config.py
:
c = get_config()
# Simply find this line and change the port value
c.NotebookApp.port = 42424
Set the following environment variables in .bashrc
or .bash_profile
:
# set this to whereever you installed spark
export SPARK_HOME="$HOME/spark-1.2.0"
# Where you specify options you would normally add after bin/pyspark
export PYSPARK_SUBMIT_ARGS="--master local[2]"
Create a file named ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
containing the following:
# Configure the necessary Spark environment
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Now we are ready to launch a notebook using the PySpark profile
ipython notebook --profile=pyspark
Word Count Example
Make sure the ipython pyspark
profile created a SparkContext by typing sc
within the notebook. You should see output similar to<pyspark.context.SparkContext at 0x1097e8e90>
.
Next, load a text file into a Spark RDD. For example, load the Spark README file:
import os
spark_home = os.environ.get('SPARK_HOME', None)
text_file = sc.textFile(spark_home + "/README.md")
The word count script below is quite simple. It takes the following steps:
- Split each line from the file into words
- Map each word to a tuple containing the word and an initial count of 1
- Sum up the count for each word
word_counts = text_file \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
At this point, the word count has not been executed (lazy evaluation). To actually count the words, execute the pipeline:
word_counts.collect()
Here’s a portion of the output:
[(u'all', 1),
(u'when', 1),
(u'"local"', 1),
(u'including', 3),
(u'computation', 1),
(u'Spark](#building-spark).', 1),
(u'using:', 1),
(u'guidance', 3),
...
(u'spark://', 1),
(u'programs', 2),
(u'documentation', 3),
(u'It', 2),
(u'graphs', 1),
(u'./dev/run-tests', 1),
(u'first', 1),
(u'latest', 1)]