Tutorial on how to install apache spark on Windows

In this tutorial I will show you how I installed Apache Spark on windows and how I setup Ipython notebook to work with it.

Before I begin with installing Spark, I have installed python. I used anaconda python distribution which has most number of popular python packages available. You can download the distribution from the below link:

https://www.continuum.io/downloads

I have downloaded Anaconda Python installer for version 3.5.

Okay once I installed the anaconda, for the IDE I have chosen to use IPython.

To install open the anaconda prompt, and enter the following command

conda install jupyter

Note if you already have ipython installed, you might want to update it using the following command

conda update jupyter

Now I have downloaded and installed spark from

http://spark.apache.org/downloads.html

I chose Spark 1.6.0 , package which was pre-built for hadoop 2.6 and later.

After downloading it, I copied the file to the C:\ drive and then unzipped it, then renamed the folder to spark. The following is the screenshot of the folder structure.

352016

I renamed the file log4j.properties.template in the conf directory to log4j.properties and then opened it in notepad++ which is a nice and clean text editor, you can even use notepad.

I changed the line where it says

log4j.rootCategory=INFO, console

to

log4j.rootCategory=WARN, console

Doing this reduces the amount of ERROR messages which we see on the console, if you want to reduce even further you can change it to show only ERROR messages by changing it to

log4j.rootCategory=ERROR, console

after saving the file I have downloaded the hadoop binary file winutils.exe, even though Spark runs independently of Hadoop, there is a bug which searches for winutils.exe which is needed for hadoop, and throws up an error.

I have download the file from the below mentioned link

http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

I have created a folder named winutils in c:\ and created bin directory and placed the winutils.exe file in it. The file location is as follows.

C:\winutils\bin\winutils.exe

Now I have opened the system properties, and added the environment variables. You can open system properties by pressing WIN + R button which open’s up the run and enter sysdm.cpl

I then clicked on advanced tab and then on environment variables. Clicked new for the user variables and added the following

variable name as HADOOP_HOME  and it's value as C:\winutils
variable name as SPARK_HOME and it's value as C:\spark

Also clicked on path variable and added $SPARK_HOME$\bin at the end

Now to launch the spark , I just opened the command prompt and entered pyspark to open the spark shell.

Now, if you want to use IPython, set the below user variables to the environment variables in the system properties just like how HADOOP_HOME was set

variable name as PYSPARK_DRIVER_PYTHON  and it's value as ipython
variable name as PYSPARK_DRIVER_PYTHON_OPTS and it's value as: notebook

Now, when you launch the command prompt and enter pyspark, you will notice jupyter notebook being launched.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值