1 Spark简介
2 安装Spark
实验环境:
宿主机操作系统: Windows10
虚拟机软件:VMware Workstation
虚拟机操作系统:Ubuntu2004LTS
2.1 安装scala
# 下载deb包
hadoop@hadoop1:~$ wget https://downloads.lightbend.com/scala/2.13.4/scala-2.13.4.deb
# 安装
hadoop@hadoop1:~$ sudo dpkg --install scala-2.13.4.deb
2.2 安装spark
# 安装spark
hadoop@hadoop1:~$ wget https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz
hadoop@hadoop1:~$ tar -zxvf spark-3.0.3-bin-hadoop3.2.tgz
hadoop@hadoop1:~$ mv spark-3.0.3-bin-hadoop3.2 spark-3.0.3
# 添加环境变量
hadoop@hadoop1:~$ sudo vi /etc/profile.d/spark.sh
添加:
export SPARK_HOME=/home/hadoop/spark-3.0.3
export PATH=$PATH:$HBASE_HOME/bin
hadoop@hadoop1:~$ source /etc/profile.d/spark.sh
# 配置spark
hadoop@hadoop1:~$ cd spark-3.0.3/conf/
# 配置slaves
hadoop@hadoop1:~$ cp slaves.template slaves
hadoop@hadoop1:~$ vi slaves
修改为:(多节点可添加配置)
hadoop1
# 配置spark-env.sh
hadoop@hadoop1:~$ cp spark-env.sh.template spark-env.sh
hadoop@hadoop1:~$ vi spark-env.sh
添加:
export JAVA_HOME=/home/hadoop/jdk1.8.0_321/
export HADOOP_HOME=/home/hadoop/hadoop-3.2.3/
export HADOOP_CONF_DIR=/home/hadoop/hadoop-3.2.3/etc/hadoop
export SCALA_HOME=/home/hadoop/scala-2.13.4/
export SPARK_MASTER_HOST=hadoop1
export SPARK_PID_DIR=/home/hadoop/spark-3.0.3/data
export SPARK_LOCAL_DIR=/home/hadoop/spark-3.0.3
export SPARK_EXECUTOR_MEMORY=512M
export SPARK_WORKER_MEMORY=4G
# 配置spark-defaults.conf
hadoop@hadoop1:~$ cp spark-defaults.conf.template spark-defaults.conf
hadoop@hadoop1:~$ vi spark-defaults.conf
添加:
spark.master spark://hadoop1:7077
# 启动服务
hadoop@hadoop1:~$ $SPARK_HOME/sbin/start-all.sh
hadoop@hadoop1:~$ jps
出现:
35568 Master (主节点)
35733 Worker (工作节点)
# 此时可以通过web访问spark:http://192.168.17.100:8080/
# 停止服务
hadoop@hadoop1:~$ $SPARK_HOME/sbin/stop-all.sh
2.3 运行测试程序
# 运行测试程序
hadoop@hadoop1:~$ $SPARK_HOME/bin/run-example SparkPi 10
# 通过访问网址查看结果:http://hadoop1:8080/cluster
# 启动spark-shell
hadoop@hadoop1:~$ $SPARK_HOME/bin/spark-shell
hadoop@hadoop1:~$ val textFile=sc.textFile("file:///home/hadoop/spark-3.0.3/README.md")
hadoop@hadoop1:~$ textFile.count()
出现结果为:
res3: Long = 108 即运行成功
# 退出spark-shell
hadoop@hadoop1:~$ :quit
2.4 安装Jupyter notebook和python3
linux安装jupyter:
# 更新源
hadoop@hadoop1:~$ sudo apt update
# 安装python
hadoop@hadoop1:~$ sudo apt install python3-pip python3-dev
# 为Jupyter创建虚拟环境
hadoop@hadoop1:~$ sudo -H pip3 install --upgrade pip
hadoop@hadoop1:~$ sudo -H pip3 install virtualenv
hadoop@hadoop1:~$ mkdir ~/my_project_dir
hadoop@hadoop1:~$ cd ~/my_project_dir
hadoop@hadoop1:~/my_project_dir$ virtualenv my_project_env
# 进入虚拟环境
hadoop@hadoop1:~/my_project_dir$ source my_project_env/bin/activate
# 安装Jupyter
(my_project_env)hadoop@hadoop1:~/my_project_dir$ pip install jupyter
# 配置密码
(my_project_env)hadoop@hadoop1:~/my_project_dir$ jupyter notebook --generate-config
(my_project_env)hadoop@hadoop1:~/my_project_dir$ jupyter notebook password
# 运行Jupyter
(my_project_env)hadoop@hadoop1:~/my_project_dir$ jupyter notebook
然后Windows+putty通过远程ssh连接使用jupyter notebook:
打开putty输入地址:
点击左侧ssh–>tunnel,输入:
点击add,然后点击open。
然后可以在Windows浏览器上输入loacalhost:8000访问linux上的jupyter了。
参考链接:How To Set Up Jupyter Notebook with Python 3 on Ubuntu 20.04
安装pip时遇到报错,如下命令解决:
sudo apt-get remove python-pip-whl
sudo apt -f install
sudo apt update && sudo apt dist-upgrade
sudo apt install python3-pip
2.5 在Windows上安装PySpark
首先进入虚拟环境,然后执行:
(my_project_env)hadoop@hadoop1:~/my_project_dir$ pip install pyspark==3.0.3
# pyspark版本必须和spark一致,否则pyspark无法在jupyter上正常运行
(my_project_env)hadoop@hadoop1:~/my_project_dir$ pyspark
参考链接:Installation
3 使用Spark
3.1 使用Python Spark Shell
(my_project_env)hadoop@hadoop1:~/my_project_dir$ pyspark
>>> from pyspark import SparkContext
>>> sc = SparkContext('local[*]')
>>> txt = sc.textFile('file:///YourFileDirectory/input/try1.txt')
>>> print(txt.count())
>>> as_lines = txt.filter(lambda line: 'as' in line.lower())
>>> print(as_lines.count())
3.2 使用spark-submit
将python代码写好存放于try1.py中:
from pyspark import SparkContext
sc = SparkContext('local[*]')
sc.setLogLevel('WARN')
txt = sc.textFile('file:///YourFileDirectory/input/try1.txt')
print(txt.count())
as_lines = txt.filter(lambda line: 'as' in line.lower())
print(as_lines.count())
然后在命令行执行spark-submit批处理执行:
(my_project_env)hadoop@hadoop1:~/my_project_dir$ spark-submit try1.py
3.3 使用jupyter notebook
参考2.4中的步骤,运程连接jupyter notebook,将python代码复制到.ipynb文件中执行。