worldCount是经典的mapreduce程序
环境:linux+spark1.6.2+pycharm
相关文档如下:http://spark.apache.org/docs/1.6.2/api/python/pyspark.html
准备工作:先安装java,maven等环境,下载最新的spark安装文件解压到/data/work/spark-1.6.2目录(我下载的安装文件为spark-1.6.2.tgz)
用maven编译
cd spark-1.6.2
mvn -DskipTests clean package
整个过程会非常痛苦,需要很长时间
安装完成以后,在pycharm上设置开发环境
在环境变量里设置如下:
程序原代码如下:
#!/usr/bin/env python
# encoding: utf-8
# 代码说明:
# 参考文档:
# http://spark.apache.org/docs/1.6.2/api/python/pyspark.html
import logging
from operator import add
from pyspark import SparkContext
"""
@version:
@software: PyCharm
@file: test_python_word_count.py
@time: 16-7-4 上午10:39
"""
logging.basicConfig(format='%(message)s', level=logging.INFO)
test_file_name = "/data/work/python-workspace/hualv/spark/test-data.txt"
out_file_name = "/data/work/python-workspace/hualv/spark/spark-out"
# Word Count
sc = SparkContext("local","Simple App")
# text_file rdd object
text_file = sc.textFile(test_file_name)
# counts
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(out_file_name)
# # flatMap 先映射后扁平化 Return a new RDD by first applying a function to all elements of this RDD,
# # and then flattening the results.
# rdd = sc.parallelize([2, 3, 4])
# print(rdd.flatMap(lambda x: range(1, x)).collect())
# # map 是直接将数据做映射
# rdd = sc.parallelize(["b", "a", "c"])
# print(rdd.map(lambda x: (x, 1)).collect())
# #reduceByKey Merge the values for each key using an associative reduce function.
# rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
# print(rdd.reduceByKey(add).collect())
运行结果如下