PySpark WordCount

使用python编写pyspark的wordcount程序,使用spark-submit分别在local和yarn方式允许;

1.1、创建测试文件

  • 本地文件
$ cd ~/pyspark/PythonProject
$ mkdir data
$ cd data/
$ vim word.txt
$ tail word.txt 
hadoop spark hive
hive java python
spark perl hadoop
python RDD spark
RDD 
  • HDFS文件
$ cd ~/pyspark/PythonProject
$ hadoop fs -put data /user/input/

1.2、编写spark wordcount程序

  • 编写wordcount 程序
$ vim wordcount.py 

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from pyspark import SparkContext, SparkConf

def CreateSparkContext():
    """创建sparkConf函数,设定app名字"""
    conf = SparkConf().setAppName("WordCount").set("spark.ui.showConsoleProgress", "false")
    sc = SparkContext(conf = conf)
    SetLogger(sc)
    SetPath(sc)
    return (sc)

def SetLogger(sc):
    """设置日志显示方式"""
    logger = sc._jvm.org.apache.log4j
    logger.LogManager.getLogger("org").setLevel(logger.Level.ERROR)
    logger.LogManager.getLogger("akka").setLevel(logger.Level.ERROR)
    logger.LogManager.getRootLogger().setLevel(logger.Level.ERROR)

def SetPath(sc):
    """定义全局path"""
    global Path
    if sc.master[0:5] == "local":
        Path = "file:/home/hadoop/pyspark/PythonProject"
    else:
        Path = "hdfs://node:9000/user/input/"

if __name__ == "__main__":
    print("开始执行wordcount...")
    sc = CreateSparkContext()
    print("开始执行读取文件...")
    textFile = sc.textFile(Path + "data/word.txt")
    print("执行map/reduce运算...")
    stringRDD = textFile.flatMap(lambda line:line.split(" "))
    countsRDD = stringRDD.map(lambda word:(word,1)).reduceByKey(lambda x,y:x+y)
    print("保存结果...")
    try:
        countsRDD.saveAsTextFile(Path + "data/output")
    except Exception as e:
        print("输出目录以及存在,请删除原目录!")
    print("结束...")
    sc.stop()

1.3、spark-submit 执行程序

1.3.1、 spark-submit 本地模式执行

  • 执行命令
$ spark-submit wordcount.py
  • 查看计算结果
$ cd ~/ipynotebook/data/
$ tree
.
├── output
│   ├── part-00000
│   └── _SUCCESS
└── word.txt

1 directory, 3 files
$ tail output/part-00000 
('hadoop', 2)
('spark', 3)
('hive', 2)
('java', 1)
('python', 2)
('perl', 1)
('RDD', 2)
('', 1)

1.3.2、spark-submit yanr 模式执行

  • 执行命令
$ HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop spark-submit --master yarn --deploy-mode client wordcount.py 
  • yarn 执行情况
$ yarn application -list -appStates ALL

Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):1
                Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1530328746140_0001             WordCount                   SPARK        hadoop     default            FINISHED           SUCCEEDED             100%                                 N/A
  • 查看计算结果
$ hadoop fs -cat /user/input/data/output/part-0000*
('python', 2)
('', 1)
('hadoop', 2)
('hive', 2)
('java', 1)
('spark', 3)
('perl', 1)
('RDD', 2)

转载于:https://blog.51cto.com/balich/2132267

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值