1. 安装单机版预编译包:
tar zxvf spark-1.3.1-bin-cdh4.tgz
注意选择的Hadoop版本一定与本机上部署的Hadoop版本一致,Java环境也要部署好。
2. 声明系统变量
vi .bashrc
添加如下内容:
export SPARK_HOME=/home/hadoop/spark-1.3.1-bin-cdh4
3. 为了使用pyspark,将此目录下的python子目录下的文件(夹)全部拷贝到本机python的子目录site-packages下
sudo cp -r python/* /usr/lib/python2.6/site-packages/
为了能够使得pyspark能够找到需要的py4j,则再进入python的site-packages,在build目录下将其拷贝到site-packages下:
sudo cp -r py4j/ ..
4. 运行例子程序,其程序代码如下:
from pyspark import SparkContext
sc = SparkContext("local[2]", "First Spark App")
# we take the raw data in CSV format and convert it into a set of records of the form (user, product, price)
data = sc.textFile("data/UserPurchaseHistory.csv").map(lambda line: line.split(",")).map(lambda record: (record[0], record[1], record[2]))
# let's count the number of purchases
numPurchases = data.count()
# let's count how many unique users made purchases
uniqueUsers = data.map(lambda record: record[0]).distinct().count()
# let's sum up our total revenue
totalRevenue = data.map(lambda record: float(record[2])).sum()
# let's find our most popular product
products = data.map(lambda record: (record[1], 1.0)).reduceByKey(lambda a, b: a + b).collect()
mostPopular = sorted(products, key=lambda x: x[1], reverse=True)[0]
# Finally, print everything out
print "Total purchases: %d" % numPurchases
print "Unique users: %d" % uniqueUsers
print "Total revenue: %2.2f" % totalRevenue
print "Most popular product: %s with %d purchases" % (mostPopular[0], mostPopular[1])
# stop the SparkContext
sc.stop()
执行命令:
python ***.py
该脚本最好的执行方式如下:
[hadoop@HADOOP-206 python-spark-app]$ /home/hadoop/spark-1.3.1-bin-cdh4/bin/spark-submit firstpy.py
5. 上述执行过程,出现如下问题:
Input path does not exist: hdfs://HADOOP-206:19000/user/hadoop/data/UserPurchaseHistory.csv
默认在HDFS上查找读取的文件,为了能够访问本地文件,修改如下:
data = sc.textFile("file:///home/hadoop/spark_code/Chapter01/python-spark-app/data/UserPurchaseHistory.csv")
这是访问本地文件的正确方式。
6. 在此之前,为了测试Spark是否可用,运行其自带的例子:
./run-example org.apache.spark.examples.SparkPi
解析其运行过程如下:
1. spark.SparkContext(启动spark)
2. spark.SecurityManager ACL
3. spark.SparkEnv (注册MapOutputTracker BlockManagerMaster)
4. storage.DiskBlockManager、storage.MemoryStore
5. spark.HttpFileServer、spark.HttpServer
6. server.Server(Jetty)、server.AbstractConnector
在端口4040成功启动SparkUI
http://HADOOP-206:4040
7. spark.SparkContext (add 任务jar包)
8. executor.Executor
启功executor
9. util.AkkaUtils
10. netty.NettyBlockTransferService 在端口10654创建server
11. storage.BlockManagerMaster 尝试注册BlockManager
12. storage.BlockManagerMasterActor 在localhost:10654注册BlockManager
13. storage.BlockManagerMaster 成功注册BlockManager
14. spark.SparkContext (启动job)
15. scheduler.DAGScheduler 提交Stage 0
16. storage.MemeryStore(Block:broadcast_0_piece0 存入内存)
17. storage.BlockManagerInfo (完成16的信息)
18. storage.BlockManagerMaster(修改Block:broadcast_0_piece0 的信息)
19. spark.SparkContext (在DAGSchduler创建广播)
20. scheduler.DAGScheduler(提交Stage 0的两个missing tasks)
21. scheduler.TaskSchedulerImpl (添加task set)
22. scheduler.TaskSetManager (启动task)
23. executor.Executor(运行task 完成task)
24. scheduler.TaskSetManager(完成task)
25. scheduler.TaskSchedulerImpl(移除task set)
26. scheduler.DAGScheduler(完成 stage 0、job 0)
27. scheduler.DAGScheduler: 停止DAGScheduler
spark.MapOutputTrackerMasterActor: 停止MapOutputTrackerActor
storage.MemoryStore: 清理MemoryStore
storage.BlockManager: 停止BlockManager
storage.BlockManagerMaster: 停止BlockManagerMaster
spark.SparkContext: 成功停止SparkContext