配置Spark环境
1) 下载spark:http://spark.apache.org/downloads.html
2) 进入spark-1.6.1-bin-hadoop2.4,为当前目录
打开Python Spark Shell:
[root@Master spark-1.6.1-bin-hadoop2.4]#./bin/pyspark
读取文件,生成RDD格式
>>> textFile = sc.textFile("README.md")
输出RDD文件特定信息
>>> textFile.count() # Number of items in this RDD
126
>>>textFile.first() # First item in this RDD
u'# Apache Spark