new:在ubuntu20上安装spark单机版程序

进入spark官网首页:下拉选择历史版本

 

选择一个3.0.1的tgz的spark

下载完成后进入文件下载目录:

输入命令将文件移动到/usr/local目录下并解压

sudo mv spark-3.0.1-bin-hadoop2.7.tgz /usr/local

sudo tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz

进入解压的目录并启动:

cd spark-3.0.1-bin-hadoop2.7

cd ./bin

./spark-shell

输出如下:可以发现spark3.0.1版本是依赖scala的2.12.10版本

因此,需要下载scala2.12.10本地版本的安装包进行安装

进入官网下载并解压到/usr/local目录下

解压后配置添加一行当前用户的环境变量:

#vim ~/.bashrc

#export PATH=$PATH:/usr/local/scala-2.12.10/bin

 

输入scala -verions 正常输出表示安装成功。

 

使用spark处理计算词频的小例子:(计算spark安装包的readme.md文件的词频)

进入pyspark环境:

 

 

输入以下三行代码:

lines = sc.textFile("file:///usr/local/(你的spark版本如:spark-3.0.1-bin-hadoop2.7-hive1.2)/README.md")    #如果路径不存在的话不会出错(scala的惰性加载,用到的时候才会报错)
lines.count()  
lines.first()

 

 

如果文件不存在,则会包如下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/rdd.py", line 1141, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/rdd.py", line 1132, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/rdd.py", line 1003, in fold
    vals = self.mapPartitions(func).collect()
  File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/rdd.py", line 889, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/sql/utils.py", line 128, in deco
    return f(*a, **kw)
  File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/spark-3.0.1-bin-hadoop2-hive1.2/README.md
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:55)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
    at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:168)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
 

 

 

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值