1.配置pycharm环境(此处需要使用远程python环境)
配置master的地址,账号,密码
2.pycharm中安装pyspark与pyspark-stubs
3.配置python环境变量
HADOOP_HOME=/usr/local/hadoop
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.232.b09-0.el7_7.x86_64
SPARK_HOME=/usr/local/spark
4.代码测试
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.master("spark://node1:7077") \
.getOrCreate()
spark.conf.set("spark.executor.memory", "500M")
sc = spark.sparkContext
a = sc.parallelize([1, 2, 3])
b = a.flatMap(lambda x: (x,x ** 2))
print(a.collect())
print(b.collect())