linux运行pyspark,pyspark使用方法

在pycharm上配置pyspark

在pycharm上配置pyspark

在windows上下面的错误,linux上应该正常

C:\ProgramData\Anaconda3\envs\tensorflow\python.exe E:/github/data-analysis/tf/SparkTest.py

2018-07-19 10:35:41 ERROR Shell:397 - Failed to locate the winutils binary in the hadoop binary path

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)

at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)

at org.apache.hadoop.util.Shell.(Shell.java:387)

at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)

at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)

at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)

at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)

2018-07-19 10:35:41 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Binarizer output with Threshold = 5.100000

[Stage 0:> (0 + 1) / 1]2018-07-19 10:35:51 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)

java.io.IOException: Cannot run program "python": CreateProcess error=2, ϵͳ�Ҳ���ָ�����ļ���

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)

at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)

at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)

at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)

Caused by: java.io.IOException: CreateProcess error=2, ϵͳ�Ҳ���ָ�����ļ���

at java.lang.ProcessImpl.create(Native Method)

at java.lang.ProcessImpl.(Unknown Source)

at java.lang.ProcessImpl.start(Unknown Source)

... 35 more

2018-07-19 10:35:51 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Cannot run program "python": CreateProcess error=2, ϵͳ�Ҳ���ָ�����ļ���

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)

at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)

at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)

at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

Caused by: java.io.IOException: CreateProcess error=2, ϵͳ�Ҳ���ָ�����ļ���

at java.lang.ProcessImpl.create(Native Method)

at java.lang.ProcessImpl.(Unknown Source)

at java.lang.ProcessImpl.start(Unknown Source)

... 35 more

2018-07-19 10:35:51 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job

Traceback (most recent call last):

File "E:/github/data-analysis/tf/SparkTest.py", line 21, in

binarizedDataFrame.show()

File "E:\hadoop-common\spark-2.3.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 350, in show

print(self._jdf.showString(n, 20, vertical))

File "E:\hadoop-common\spark-2.3.1-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__

File "E:\hadoop-common\spark-2.3.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco

return f(*a, **kw)

File "E:\hadoop-common\spark-2.3.1-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o49.showString.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Cannot run program "python": CreateProcess error=2, 系统找不到指定的文件。

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)

at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)

at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)

at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

Caused by: java.io.IOException: CreateProcess error=2, 系统找不到指定的文件。

at java.lang.ProcessImpl.create(Native Method)

at java.lang.ProcessImpl.(Unknown Source)

at java.lang.ProcessImpl.start(Unknown Source)

... 35 more

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)

at

Caused by: java.io.IOException: Cannot run program "python": CreateProcess error=2, 系统找不到指定的文件。

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)

at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)

at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)

at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

Caused by: java.io.IOException: CreateProcess error=2, 系统找不到指定的文件。

Process finished with exit code

windows操作系统的原因

Anaconda上jupyter notebook使用pyspark

打开anaconda navigator,可以安装pyspark。可以运行,不用配置hadoop

d2188251b707

image.png

另外参考

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值