最近由于需要使用本机环境运行spark,所以进行了spark的安装,
记得当年(2017年)装spark的时候折腾了几个小时才装好,没想到现在安装的流程这么简单,
1. 下载安装包
http://spark.apache.org/downloads.html
这里我选的是spark-2.4.7-bin-hadoop2.7.tgz
2. 安装
cd /usr/local
mv ~/Downloads/spark-3.0.0-preview2-bin-hadoop2.7.tgz ./ # 下载spark在Downloads目录中
tar -zxvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
vim ~/.bash_profile
# bash_profile环境变量中添加如下配置
export SPARK_HOME=/usr/local/spark-3.0.0-preview2-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=python3
#
source ~/.bash_profile
conda activate test # 切换到自己的环境
pip install pyspark # 安装pyspark
3. 验证
打开pycharm(或者jupyter及其他IDE)执行以下代码,查看输出是否正确
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.feature import Binarizer
spark = SparkSession \
.builder \
.appName("BinarizerExample") \
.getOrCreate()
continuousDataFrame = spark.createDataFrame([
(0, 1.1),
(1, 8.5),
(2, 5.2)
], ["id", "feature"])
binarizer = Binarizer(threshold=5.1, inputCol="feature", outputCol="binarized_feature")
binarizedDataFrame = binarizer.transform(continuousDataFrame)
print("Binarizer output with Threshold = %f" % binarizer.getThreshold())
binarizedDataFrame.show()
spark.stop()
Binarizer output with Threshold = 5.100000
+---+-------+-----------------+
| id|feature|binarized_feature|
+---+-------+-----------------+
| 0| 1.1| 0.0|
| 1| 8.5| 1.0|
| 2| 5.2| 1.0|
+---+-------+-----------------+