参考网站:林子雨《Spark编程基础》官网
电脑太垃圾了,可能不久要换个新的,就把基本的命令搬了一下,以后安装快一点。
安装Anaconda
Anaconda清华大学镜像下载
Anaconda3-2020.02-Linux-x86_64.sh
$ cd ~/下载
$ bash Anaconda3-2020.02-Linux-x86_64.sh
浏览许可证,回复yes
按回车默认安装路径
conda初始化 yes 下载时不要按回车,会自动no
$ conda -V
$ anaconda -V
$ conda config --set auto_activate_base false #消除base
$ sudo vim ~/.bashrc
export PATH=$PATH:/home/hadoop/anaconda3/bin
$ source ~/.bashrc
$ anaconda -V
anaconda Command line client (version 1.7.2)
配置JupyterNotebook
$ conda install jupyter notebook
$ jupyter notebook --generate-config
$ cd /home/hadoop/anaconda3/bin
$ ./python # 进入Python环境
>>> from notebook.auth import passwd
>>> passwd()
'sha1:4b2678fa7669:037692fc089b07c56f10b5b50e11e00e5a87c4b3'
$ vim ~/.jupyter/jupyter_notebook_config.py
c.NotebookApp.ip='*' # 就是设置所有ip皆可访问
c.NotebookApp.password = 'sha1:4b2678fa7669:037692fc089b07c56f10b5b50e11e00e5a87c4b3' # 上面复制的那个sha密文'
c.NotebookApp.open_browser = False # 禁止自动打开浏览器
c.NotebookApp.port =8888 # 端口
c.NotebookApp.notebook_dir = '/home/hadoop/jupyternotebook' #设置Notebook启动进
入的目录
$ cd /home/hadoop
$ mkdir jupyternotebook
$ jupyter notebook
打开localhost:8888,输入密码
JupyterNotebook与Pyspark交互
$ vim ~/.bashrc
# 删除 export PYSPARK_PYTHON=python3
export PYSPARK_PYTHON=/home/hadoop/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/home/hadoop/anaconda3/bin/python
$ source ~/.bashrc
测试
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)
logFile = "file:///usr/local/spark/README.md"
logData = sc.textFile(logFile, 2).cache()
numAs = logData.filter(lambda line: 'a' in line).count()
numBs = logData.filter(lambda line: 'b' in line).count()
print('Lines with a: %s, Lines with b: %s' % (numAs, numBs))
# Lines with a: 62, Lines with b: 31
# 文件路径别写错了
有点慢,忍一下
啊 运行失败,应为hdfs没启动
只要出现了file或hdfs,都要启动hdfs