一 预备条件
下载安装scale:Scala 2.12.7
https://www.scala-lang.org/download/all.html
二 安装spark
下载安装spark:
http://spark.apache.org/downloads.html
三 配置
- 先配置系统环境变量
- 安装
用start-all.cmd启动hadoop,先确认/tmp/hive存在:
D:\hadoop-3.1.1\bin>hadoop fs -ls /tmp/hive
Found 2 items
drwx-wx-wx - hawkzy supergroup 0 2018-10-24 16:32 /tmp/hive/_resultscache_
drwx------ - hawkzy supergroup 0 2018-10-24 16:38 /tmp/hive/hawkzy
修改属性:
D:\hadoop-3.1.1\bin>hadoop fs -chmod 777 /tmp/hive
测试spark-shell:
D:\spark-2.3.2\bin>spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://DESKTOP-RNHEU8M:4040
Spark context available as 'sc' (master = local[*], app id = local-1540559586045).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.2
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :quit
四 PySpark
如果直接用pip install pyspark安装,可能会很慢。
建议直接到pypi上下tz包:
下载PySpark:
https://pypi.org/project/pyspark/#files
或者,使用国内的镜像安装,下面是常用镜像:
清华:https://pypi.tuna.tsinghua.edu.cn/simple
阿里云:https://mirrors.aliyun.com/pypi/simple/
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/
华中理工大学:http://pypi.hustunique.com/
山东理工大学:http://pypi.sdutlinux.org/
豆瓣:http://pypi.douban.com/simple/
下载命令,这里我期望装到python3,所以用pip3:
pip3 install -i https://mirrors.aliyun.com/pypi/simple pyspark
如果是下载tz包后安装,那么解压后进入PySpark的安装目录,开始安装PySpark:
执行python setup.py install
新增spark_test.py:
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile('words.txt')
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
sc.stop()
新增words.txt:
old new history
hadoop spark hive
good spark mlib
cool spark bad
spark good hive bad
执行pyspark:
D:\Python\workspace>pyspark
Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.2
/_/
Using Python version 3.7.0 (default, Jun 28 2018 08:04:48)
SparkSession available as 'spark'.
尝试前面准备的脚本:
>>> os.system(r"python d:\python\workspace\spark_test.py")
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
new: 1
hadoop: 1
hive: 2
good: 2
old: 1
history: 1
spark: 4
mlib: 1
cool: 1
bad: 2
0
>>> 成功: 已终止 PID 13308 (属于 PID 11872 子进程)的进程。
成功: 已终止 PID 11872 (属于 PID 13340 子进程)的进程。
成功: 已终止 PID 13340 (属于 PID 15552 子进程)的进程。
完成!