上一篇文章中已提到如何在linux中搭建pyspark环境,如果需要的同学请查看linux虚拟机搭建pyspark环境文章,本次主要讲解如何直接在环境中能够直接使用已写好的py文件直接运行。
文件共享
虚拟机和主机需要配置文件共享,首先将virtualbox安装增强工具,安装之前需要执行以下命令,提前安装好需要的程序
yum update
yum install gcc
yum install gcc-c++
yum install make
yum install kernel-headers
yum install kernel-devel
由于增强工具是bzip2格式的,因此提前安装好
yum -y install bzip2*
如果上述命令权限不够,在前面加上sudo
然后开始安装增强工具,点击Devices
,然后选择Insert Guest Additions CD image
如果出现了mount类的错误,则需要先将以前的ISO退出,如下选择Devices
->Optical Drives
->Remove disk from virtual drive
然后再重新进行上一步骤。
当点击后,ssh进入到虚拟机中,然后通过命令进入到/mnt/tmp
目录下
cd /mnt/tmp
sudo ./VBoxLinuxAdditions.run
安装成功后便可进行配置文件夹共享了,配置路径方法为Device
->share folders
->setting
配置完成后,可以在/media/
下看到配置的共享路径
然后就可以将文件复制到主机对应的路径下,从而在虚拟机共享路径下看到了,自此共享文件夹配置完毕
执行py文件运行
新建py文件,当中新增代码
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
l = [('Alice', 1)]
spark.createDataFrame(l).collect()
data = spark.createDataFrame(l, ['name', 'age']).collect()
print(data)
print('hello spark')
然后在命令中执行
spark-submit test1.py
将会输出
[root@localhost sf_sparkVM]# spark-submit test1.py
2018-07-01 23:46:28 WARN Utils:66 - Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.0.104 instead (on interface enp0s3)
2018-07-01 23:46:28 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-07-01 23:46:29 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-07-01 23:46:29 INFO SparkContext:54 - Running Spark version 2.3.1
2018-07-01 23:46:29 INFO SparkContext:54 - Submitted application: Word Count
2018-07-01 23:46:29 INFO SecurityManager:54 - Changing view acls to: root
2018-07-01 23:46:29 INFO SecurityManager:54 - Changing modify acls to: root
2018-07-01 23:46:29 INFO SecurityManager:54 - Changing view acls groups to:
2018-07-01 23:46:29 INFO SecurityManager:54 - Changing modify acls groups to:
2018-07-01 23:46:29 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-07-01 23:46:30 INFO Utils:54 - Successfully started service 'sparkDriver' on port 35293.
2018-07-01 23:46:30 INFO SparkEnv:54 - Registering MapOutputTracker
2018-07-01 23:46:30 INFO SparkEnv:54 - Registering BlockManagerMaster
2018-07-01 23:46:30 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-07-01 23:46:30 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-07-01 23:46:30 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-84bec309-d89d-48f6-88a3-1a0b880a71d7
2018-07-01 23:46:30 INFO MemoryStore:54 - MemoryStore started with capacity 413.9 MB
2018-07-01 23:46:30 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2018-07-01 23:46:30 INFO log:192 - Logging initialized @3342ms
2018-07-01 23:46:30 INFO Server:346 - jetty-9.3.z-SNAPSHOT
2018-07-01 23:46:30 INFO Server:414 - Started @3525ms
2018-07-01 23:46:30 INFO AbstractConnector:278 - Started ServerConnector@62115e4f{HTTP/1.1,[http/1.1]}{
0.0.0.0:4040}
2018-07-01