Pycharm 搭建pyspark开发环境

Pycharm 搭建pyspark开发环境

spark安装
  • spark下载

下载地址 http://spark.apache.org/downloads.html

本次是搭建环境使用的官网已编译的版本,如需自己编译可参照官网自行编译,地址为 http://spark.apache.org/docs/latest/building-spark.html

  • 验证spark是否安装成功
(spark_demo)  shylin  ~/Desktop/work/spark_demo  cd ~/Downloads/spark-2.4.0-bin-hadoop2.7/bin
(spark_demo)  shylin  ~/Downloads/spark-2.4.0-bin-hadoop2.7/bin  ./pyspark 
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2019-05-24 10:55:19 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018 19:50:54)
SparkSession available as 'spark'.
>>> 

搭建pyspark开发环境
  • 打开pycharm,新建project,创建一个新的虚拟环境

  • 配置项目环境变量,方便每次创建新py文件都要再次环境变量,操作如下图所示

在这里插入图片描述

SPARK_HOME /Users/shylin/Downloads/spark-2.4.0-bin-hadoop2.7(spark的下载目录)

  • 添加两个包到项目目录下 包的路径如下: /Users/shylin/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib
    在这里插入图片描述
    在这里插入图片描述
  • 执行wordcount
# word_count.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-

from pyspark.sql import SparkSession

if __name__ == '__main__':

    spark = SparkSession.builder.appName("test").getOrCreate()
    sc = spark.sparkContext
		# file:// 指明是本地文件,默认是找hdfs文件
    counts = sc.textFile('file:///Users/shylin/Desktop/work/spark_demo/test.txt') \
            .flatMap(lambda line: line.split(" ")) \
            .map(lambda x: (x, 1)) \
            .reduceByKey(lambda a, b: a + b)

    output = counts.collect()

    for (word, count) in output:
        print("%s: %i" % (word, count))

    sc.stop()
  • 执行结果
/Users/shylin/.virtualenvs/spark_demo/bin/python /Users/shylin/Desktop/work/spark_demo/word_count.py
2019-05-24 11:23:05 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 0:>                                                          (0 + 2) / 2]/Users/shylin/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:60: UserWarning: Please install psutil to have better support with spilling
/Users/shylin/Downloads/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:60: UserWarning: Please install psutil to have better support with spilling
world: 2
python: 3
hadoop: 1
hello: 4
spark: 1

Process finished with exit code 0

以上结果显示一个警告, pip install psutil 即可消除警告

  • 服务器上提交spark 任务
sudo -uhive spark-submit --master local[4] word_count.py

Shylin

  • 2
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值