In this tutorial I will show you how I installed Apache Spark on windows and how I setup Ipython notebook to work with it.
Before I begin with installing Spark, I have installed python. I used anaconda python distribution which has most number of popular python packages available. You can download the distribution from the below link:
https://www.continuum.io/downloads
I have downloaded Anaconda Python installer for version 3.5.
Okay once I installed the anaconda, for the IDE I have chosen to use IPython.
To install open the anaconda prompt, and enter the following command
conda install jupyter
Note if you already have ipython installed, you might want to update it using the following command
conda update jupyter
Now I have downloaded and installed spark from
http://spark.apache.org/downloads.html
I chose Spark 1.6.0 , package which was pre-built for hadoop 2.6 and later.
After downloading it, I copied the file to the C:\ drive and then unzipped it, then renamed the folder to spark. The following is the screenshot of the folder structure.
I renamed the file log4j.properties.template in the conf directory to log4j.properties and then opened it in notepad++ which is a nice and clean text editor, you can even use notepad.
I changed the line where it says
log4j.rootCategory=INFO, console
to
log4j.rootCategory=WARN, console
Doing this reduces the amount of ERROR messages which we see on the console, if you want to reduce even further you can change it to show only ERROR messages by changing it to
log4j.rootCategory=ERROR, console
after saving the file I have downloaded the hadoop binary file winutils.exe, even though Spark runs independently of Hadoop, there is a bug which searches for winutils.exe which is needed for hadoop, and throws up an error.
I have download the file from the below mentioned link
http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
I have created a folder named winutils in c:\ and created bin directory and placed the winutils.exe file in it. The file location is as follows.
C:\winutils\bin\winutils.exe
Now I have opened the system properties, and added the environment variables. You can open system properties by pressing WIN + R button which open’s up the run and enter sysdm.cpl
I then clicked on advanced tab and then on environment variables. Clicked new for the user variables and added the following
variable name as HADOOP_HOME and it's value as C:\winutils variable name as SPARK_HOME and it's value as C:\spark
Also clicked on path variable and added $SPARK_HOME$\bin at the end
Now to launch the spark , I just opened the command prompt and entered pyspark to open the spark shell.
Now, if you want to use IPython, set the below user variables to the environment variables in the system properties just like how HADOOP_HOME was set
variable name as PYSPARK_DRIVER_PYTHON and it's value as ipython variable name as PYSPARK_DRIVER_PYTHON_OPTS and it's value as: notebook
Now, when you launch the command prompt and enter pyspark, you will notice jupyter notebook being launched.