进入spark官网首页:下拉选择历史版本
选择一个3.0.1的tgz的spark
下载完成后进入文件下载目录:
输入命令将文件移动到/usr/local目录下并解压
sudo mv spark-3.0.1-bin-hadoop2.7.tgz /usr/local
sudo tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz
进入解压的目录并启动:
cd spark-3.0.1-bin-hadoop2.7
cd ./bin
./spark-shell
输出如下:可以发现spark3.0.1版本是依赖scala的2.12.10版本
因此,需要下载scala2.12.10本地版本的安装包进行安装
进入官网下载并解压到/usr/local目录下
解压后配置添加一行当前用户的环境变量:
#vim ~/.bashrc
#export PATH=$PATH:/usr/local/scala-2.12.10/bin
输入scala -verions 正常输出表示安装成功。
使用spark处理计算词频的小例子:(计算spark安装包的readme.md文件的词频)
进入pyspark环境:
输入以下三行代码:
lines = sc.textFile("file:///usr/local/(你的spark版本如:spark-3.0.1-bin-hadoop2.7-hive1.2)/README.md") #如果路径不存在的话不会出错(scala的惰性加载,用到的时候才会报错) lines.count() lines.first()
如果文件不存在,则会包如下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/rdd.py", line 1141, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/rdd.py", line 1132, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/rdd.py", line 1003, in fold
vals = self.mapPartitions(func).collect()
File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/rdd.py", line 889, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/local/spark-3.0.1-bin-hadoop2.7-hive1.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/spark-3.0.1-bin-hadoop2-hive1.2/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:55)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:168)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)