如何在Windows上的Jupyter Notebook中安装和运行PySpark

When I write PySpark code, I use Jupyter notebook to test my code before submitting a job on the cluster. In this post, I will show you how to install and run PySpark locally in Jupyter Notebook on Windows. I’ve tested this guide on a dozen Windows 7 and 10 PCs in different languages.

A. Items needed

  • Spark distribution from spark.apache.org

68720d15ly1g22aqte4nsj20fo048mxa.jpg

68720d15ly1g22as62evlj20ry04wweo.jpg

  • The findspark Python module, which can be installed by running python -m pip install findspark either in Windows command prompt or Git bash if Python is installed in item 2. You can find command prompt by searching cmd in the search box.

68720d15ly1g22aswqwu9j20hw04mq32.jpg

  • If you don’t have Java or your Java version is 7.x or less, download and install Java from Oracle. I recommend getting the latest JDK (current version 9.0.1).

68720d15ly1g22b9fqey5j20fe04j0st.jpg

  • If you don’t know how to unpack a .tgz file on Windows, you can download and install 7-zip on Windows to unpack the .tgz file from Spark distribution in item 1 by right-clicking on the file icon and select 7-zip > Extract Here.

68720d15ly1g22bameu5aj20gm02pt8m.jpg

B. Installing PySpark

After getting all the items in section A, let’s set up PySpark.

  • Unpack the .tgz file. For example, I unpacked with 7zip from step A6 and put mine under D:\spark\spark-2.2.1-bin-hadoop2.7

68720d15ly1g22bbi6eqmj20gm02pt8m.jpg

  • Move the winutils.exe downloaded from step A3 to the \bin folder of Spark distribution. For example, D:\spark\spark-2.2.1-bin-hadoop2.7\bin\winutils.exe

  • Add environment variables: the environment variables let Windows find where the files are when we start the PySpark kernel. You can find the environment variable settings by putting “environ…” in the search box.

The variables to add are, in my example,

NameValue
SPARK_HOMED:\spark\spark-2.2.1-bin-hadoop2.7
HADOOP_HOMED:\spark\spark-2.2.1-bin-hadoop2.7
PYSPARK_DRIVER_PYTHONjupyter
PYSPARK_DRIVER_PYTHON_OPTSnotebook

68720d15ly1g22bds8ixcj20i404ga9x.jpg

  • In the same environment variable settings window, look for the Path or PATH variable, click edit and add D:\spark\spark-2.2.1-bin-hadoop2.7\bin to it. In Windows 7 you need to separate the values in Path with a semicolon ; between the values.

  • (Optional, if see Java related error in step C) Find the installed Java JDK folder from step A5, for example, D:\Program Files\Java\jdk1.8.0_121, and add the following environment variable

NameValue
JAVA_HOMED:\Progra~1\Java\jdk1.8.0_121

If JDK is installed under \Program Files (x86), then replace the Progra~1 part by Progra~2 instead. In my experience, this error only occurs in Windows 7, and I think it’s because Spark couldn’t parse the space in the folder name.

Edit (1/23/19): You might also find Gerard’s comment helpful: http://disq.us/p/1z5qou4

C. Running PySpark in Jupyter Notebook

To run Jupyter notebook, open Windows command prompt or Git Bash and run jupyter notebook. If you use Anaconda Navigator to open Jupyter Notebook instead, you might see a Java gateway process exited before sending the driver its port number error from PySpark in step C. Fall back to Windows cmd if it happens.

Once inside Jupyter notebook, open a Python 3 notebook

68720d15ly1g22bf6slabj20du074q2v.jpg

In the notebook, run the following code

import findspark
findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.sql('''select 'spark' as hello ''')
df.show()

When you press run, it might trigger a Windows firewall pop-up. I pressed cancel on the pop-up as blocking the connection doesn’t affect PySpark.

68720d15ly1g22bfn8vm8j20mz0fl75a.jpg

If you see the following output, then you have installed PySpark on your Windows system!

68720d15ly1g22bg1v9vej20cx057mx1.jpg

转载于:https://www.cnblogs.com/chenxiangzhen/p/10706258.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值