pyspark 入门小案例
导入相应的依赖包
import sys
from pyspark import SparkConf, SparkContext
设置对应的导入
if name == ‘main’:
if len(sys.argv)!=3:
print("Usage:wordcount ", sys.stderr)
sys.exit(-1)
配置配置参数
conf=SparkConf()
sc=SparkContext(conf=conf);
定义一个打印方法
def printresult():
counts=sc.textFile(sys.argv[1]).flatMap(lambda x:x.split(" "))\
.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
print(counts.collect())
output=counts.collect()
for (i,j) in output:
print("%s:%s" %(i,j))
定义导出方法
def save_file():
sc.textFile(sys.argv[1]).flatMap(lambda x:x.split(" ")).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).saveAsTextFile(sys.argv[2])
存储最终统计的文件
save_file()
关闭程序,释放空间
sc.stop()
放到服务器上运行
./spark-submit --master local[4] --name pyspark1006 /opt/pyspark_scripty/py_wc.py file:///opt/hello.txt file:///opt/pyspark_scripty/wc