在anaconda环境下搭建python3.5 + jupyter sparkR，scala，pyspark

ljtyxl

已于 2022-12-28 19:58:41 修改

阅读量2.5k

点赞数 2

分类专栏：分布式系统 bigdata 文章标签： jupyter scala ide

于 2019-05-11 17:58:59 首次发布

本文链接：https://blog.csdn.net/u014033218/article/details/87274472

版权

bigdata 同时被 2 个专栏收录

102 篇文章 0 订阅

订阅专栏

分布式系统

12 篇文章 0 订阅

订阅专栏

在anaconda环境下搭建python3.5 + jupyter sparkR，scala，pyspark

多用户jupyterhub+kubernetes 认证：https://my.oschina.net/u/2306127/blog/1837196

https://ojerk.cn/Ubuntu%E4%B8%8B%E5%A4%9A%E7%94%A8%E6%88%B7%E7%89%88jupyterhub%E9%83%A8%E7%BD%B2/

Index of /

ubuntu16.4

curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh

python3.5.2

We can now verify the data integrity of the installer with cryptographic hash verification through the SHA-256 checksum. We’ll use the sha256sum command along with the filename of the script:

sha256sum Anaconda3-4.4.0-Linux-x86_64.sh

You’ll receive output that looks similar to this:

73b51715a12b6382dd4df3dd1905b531bd6792d4aa7273b2377a0436d45f0e78 Anaconda3-4.2.0-Linux-x86_64.sh

You should check the output against the hashes available at the Anaconda with Python 3 on 64-bit Linux page for your appropriate Anaconda version. As long as your output matches the hash displayed in the sha2561 row then you’re good to go.

Now we can run the script:

bash Anaconda3-4.2.0-Linux-x86_64.sh

Anaconda3 will now be installed into this location:

/opt/anaconda3

[/opt/anaconda3] >>>

PATH=/opt/anaconda3/bin

conda create -n jupyter_py352_env python=3.5.2

jupyterhub单服务器多用户模式安装

首先安装python3以上版本。

source activate jupyter_env35

执行以下命令

sudo apt-get install gcc

sudo apt-get install openssl

sudo apt-get install libssl-dev

centos：

sudo yum install openssl-devel

# Installing Python3 (dependency of jupyterhub is on python3)
$ sudo apt-get install python3-pip

# 安装最新版本的npm/nodejs-legacy
sudo apt-get install npm nodejs-legacy

1.2 nodejs 安装

nodejs 和 npm 的安装：

apt install nodejs-legacy
apt install npm

更新（推荐更新到新版，apt 安装的版本太旧，会导致很多错误）：

npm install -g n  # 安装
n stable  # 更新 nodejs
npm install -g npm

# 安装hub和代理
conda install jupyterhub
npm install -g configurable-http-proxy
# needed if running the notebook servers locally
pip install jupyter-conda
conda install notebook

判断是否安装成功:

jupyterhub -h

configurable-http-proxy -h

jupyterhub --no-ssl

增加用户用于登录：

添加用户和组

groupadd jupyter_usergroup
sudo useradd -c "jupyter user test" -g jupyter_usergroup -d /home/jupyter_user2 jupyter_user2 -m
ls /home/
useradd jupyter_user2

passwd jupyter_user2 123456

例：将用户 user1 加入到 users组中，
usermod –g jupyter_user2 jupyter_usergroup
查看jupyterhub用户组的所有用户：

 GID=`grep 'jupyterhub' /etc/group|awk -F':' '{print $3}'`
 awk -F":" '{print $1"\t"$4}' /etc/passwd |grep $GID

passwd 用户名修改某个用户的密码[root用户]

运行以下命令生成配置文件

jupyterhub --generate-config

修改配置文件：

## The number of threads to allocate for encryption
#c.CryptKeeper.n_threads =8
c.JupyterHub.ip = '0.0.0.0'
c.JupyterHub.port = 8000
c.PAMAuthenticator.encoding = 'utf-8'
c.LocalAuthenticator.create_system_users = True
c.LocalAuthenticator.group_whitelist = {'jupyterhub'}
c.Authenticator.whitelist = {'ubuntu','jupyter_user1','jupyter_user2','test01'}
c.JupyterHub.admin_users = {'ubuntu'}
c.JupyterHub.statsd_prefix = 'jupyterhub'

增加白名单及管理员用户

安装jupyterLab

参考：云服务器搭建神器JupyterLab（多图）_D介子的博客-CSDN博客

(ljt_env) ubuntu@node1:~/ljt_test$ python

Python 3.5.4 |Anaconda, Inc.| (default, Feb 19 2018, 10:59:04)

[GCC 7.2.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> from notebook.auth import passwd

>>> passwd()

Enter password:

Verify password:

'sha1:25f46ecf43f0:8f778092033e870fec6718189eaeba118aec807a'

vim /home/ubuntu/.jupyter/jupyter_notebook_config.py

在文件末尾添加

c.NotebookApp.allow_root = True

c.NotebookApp.ip = '0.0.0.0'

c.NotebookApp.notebook_dir = u'/home/ubuntu/ljt_test/jupyterhubHome'

c.NotebookApp.open_browser = False

c.NotebookApp.password = u' sha1:25f46ecf43f0:8f778092033e870fec6718189eaeba118aec807a '

c.NotebookApp.port = 8000

(ljt_env) ubuntu@node1:~/ljt_test$ jupyter-lab --version

0.34.9

开启 jupyterhub

1.激活python3.5 虚拟环境

cd /home/ubuntu/ljt_test

source activate ljt_env

2. 检查jupyterhub 服务

ps -ef|grep jupyterhub

lsof -i:8000

kill -9 45654

tail -fn200 /home/ubuntu/ljt_test/jupyterhub.log

3.若没有开启jupyter服务，开启服务

nohup jupyterhub --no-ssl > jupyterhub.log &

nohup jupyterhub -f /etc/jupyterhub_config.py --no-ssl > jupyterhub.log &

nohup jupyterhub -f /home/ubuntu/ljt_test/jupyterhub_config.py --ssl-key /home/ubuntu/ljt_test/mykey.key --ssl-cert /home/ubuntu/ljt_test/mycert.pem > jupyterhub.log &

4、测试访问

用IP+端口测试访问

5、用户管理

用户白名单的用户会自动添加，但无密码，需要修改密码才能登录；

新添加用户：sudo useradd jupyter_user2 -d /home/ubuntu/jupyter_user2 -m

用户添加组：sudo adduser jupyter_user2 jupyterhub

修改用户密码：echo crxis:crxis|chpasswd

问题：

PAM Authentication failed (test01@222.180.208.234): [PAM Error 7] Authentication failure

验证器说明


PAMAuthenticator	默认，内置身份验证器
OAuthenticator	OAuth + JupyterHub身份验证器= OAuthenticator
LdapAuthenticator	用于JupyterHub的简单LDAP身份验证器插件
kdcAuthenticator	JupyterHub的Kerberos身份验证器插件

https://blog.chairco.me/posts/2018/06/how%20to%20build%20a%20jupytre-hub%20for%20team.html

注意：我的服务器登陆后默认就是管理员，所以下面过程都是在管理员root身份下安装的，如果当前用户不是管理员那会出现乱七八糟的问题（猜测是因为认证的问题，而上面那个博客里详细记录了認證方式（PAM）的配置过程。可以试试）

Quickstart — JupyterHub documentation

如何在非安全的CDH集群中部署多用户JupyterHub服务并集成Spark2 - 腾讯云开发者社区-腾讯云

JupyterHub与OpenLDAP集成 - 腾讯云开发者社区-腾讯云

jupyterhub -f ./jupyterhub_config.py --ssl-key ./mykey.key --ssl-cert ./mycert.pem

参考：

How To Install the Anaconda Python Distribution on Ubuntu 16.04 | DigitalOcean

Installation of Jupyterhub on remote server · jupyterhub/jupyterhub Wiki · GitHub

JupyterHub的安装与配置——让Jupyter支持多用户 - crxis - 博客园

[远程使用Jupyter]本地使用服务器端运行的Jupyter Notebook_Papageno2018的博客-CSDN博客

https://49.4.6.110:8000/

云服务器搭建神器JupyterLab

云服务器搭建神器JupyterLab（多图）_D介子的博客-CSDN博客

记一次在服务器上配置 Jupyterhub 作为系统服务

https://blog.huzicheng.com/2018/01/04/jupyter-as-a-service/

配置 jupyterhub

配置 jupyterhub – 一方的天地

make Pyspark working inside jupyterhub

conda install -c conda-forge pyspark==2.2.0

vim /etc/profile

export JAVA_HOME=/usr/java/jdk1.8.0_181

export JRE_HOME=$JAVA_HOME/jre

export CLASSPATH=.:${JAVA_HOME}/lib

export PATH=$PATH:$HADOOP_HOME/bin

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

export PATH=${SPARK_HOME}/bin:$PATH

export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

source /etc/profile

How to integrate JupyterHub with the existing Cloudera cluster

How to integrate JupyterHub with the existing Cloudera cluster · Issue #2116 · jupyterhub/jupyterhub · GitHub

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Vim /home/ubuntu/anaconda3/share/jupyter/kernels/python3/kernel.json

{
	"argv": ["/home/ljt/anaconda3/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}"],
	"display_name": "Python 3.6+Pyspark2.4.3",
	"language": "python",
	"env": {
		"HADOOP_CONF_DIR": "/mnt/e/hadoop/3.1.1/conf",
		"PYSPARK_PYTHON": "/home/ljt/anaconda3/bin/python",
		"SPARK_HOME": "/mnt/f/spark/2.4.3",
		"WRAPPED_SPARK_HOME": "/etc/spark",
		"PYTHONPATH": "/mnt/f/spark/2.4.3/python/:/mnt/f/spark/2.4.3/python/lib/py4j*src.zip",
		"PYTHONSTARTUP": "/mnt/f/spark/2.4.3/python/pyspark/shell.py",
		"PYSPARK_SUBMIT_ARGS": "--master yarn  --deploy-mode client pyspark-shell" 
	}
}

jupyterhub -f jupyterhub_config.py

nohup jupyterhub -f /home/ubuntu/ljt_test/jupyterhub_config.py --no-ssl > jupyterhub.log &

nohup jupyter lab --config=/home/ubuntu/.jupyter/jupyter_notebook_config.py > jupyter.log &

Scala

https://medium.com/@bogdan.cojocar/how-to-run-scala-and-spark-in-the-jupyter-notebook-328a80090b3b

Step1: install the package

pip install spylon-kernel

Step2: create a kernel spec

This will allow us to select the scala kernel in the notebook.

python -m spylon_kernel install

Step3: start the jupyter notebook

ipython notebook

And in the notebook we select New -> spylon-kernel . This will start our scala kernel.

Step4: testing the notebook

Let’s write some scala code:

val x = 2
val y = 3

x+y

r kernel

conda install -c r r-irkernel

Notebook :: Anaconda.org

conda install -c r r-essentials

conda create -n my-r-env -c r r-essentials

GitHub - sparklyr/sparklyr: R interface for Apache Spark

SparkR (R on Spark) - Spark 2.2.0 Documentation

SparkR安装 - lalabola - 博客园

export LD_LIBRARY_PATH="/usr/java/jdk1.8.0_181/jre/lib/amd64/server"

rJAVA

Installing RJava (Ubuntu) · hannarud/r-best-practices Wiki · GitHub

java 环境需要在/user/lib/jvm 下面

SparkR on Yarn 安装配制

SparkR on Yarn 安装配制 - 简书

conda create -p /home/ubuntu/anaconda3/envs/r_env --copy -y -q r-essentials -c r

sparkR --executor-memory 2g --total-executor-cores 10 --master spark://node1.sc.com:7077

http://cleverowl.uk/2016/10/15/installing-jupyter-with-the-pyspark-and-r-kernels-for-spark-development/

sparklyr

Redirect

Sys.setenv(SPARK_HOME=' /opt/cloudera/parcels/CDH/lib/spark')

.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))

library(SparkR)

sc <- sparkR.init(master='yarn-client', sparkPackages="com.databricks:spark-csv_2.11:1.5.0")

sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")

head(df)

configurable-http-proxy --ip **.*.6.110 --port 8000 --log-level=debug

openssl rand -hex 32

0b042f8d651fb8126537d1ec98507b093653d1ffe4b909f053616062184b1db3

Sparkrly：

sparklyr - Configuring Spark Connections

sparklyr: a test drive on YARN | R-bloggers

Ubuntu 下安装sparklyr 并连接远程spark集群_The_One_is_all的博客-CSDN博客

# .R script showing capabilities of sparklyr R package
# Prerequisites before running this R script: 
# Ubuntu 16.04.3 LTS 64-bit, r-base (version 3.4.1 or newer), 
# RStudio 64-bit version, libssl-dev, libcurl4-openssl-dev, libxml2-dev
install.packages("httr")
install.packages("xml2")
# New features in sparklyr 0.6:
# https://blog.rstudio.com/2017/07/31/sparklyr-0-6/
install.packages("sparklyr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
library(sparklyr)
library(dplyr)
library(ggplot2)
library(tidyr)
set.seed(100)
# sparklyr cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/sparklyr.pdf
# dplyr+tidyr: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
# sparklyr currently (2017-08-19) only supports Apache Spark version 2.2.0 or older
# Install Spark locally:
sc_version <- "2.2.0"
spark_install(sc_version)
config <- spark_config()
# number of CPU cores to use:
config$spark.executor.cores <- 6
# amount of RAM to use for Apache Spark executors:
config$spark.executor.memory <- "4G"
# Connect to local version:
sc <- spark_connect (master = "local",
 config = config, version = sc_version)
# Copy data to Spark memory:
import_iris <- sdf_copy_to(sc, iris, "spark_iris", overwrite = TRUE) 
# partition data:
partition_iris <- sdf_partition(import_iris,training=0.5, testing=0.5) 
# Create a hive metadata for each partition:
sdf_register(partition_iris,c("spark_iris_training","spark_iris_test")) 
# Create reference to training data in Spark table
tidy_iris <- tbl(sc,"spark_iris_training") %>% select(Species, Petal_Length, Petal_Width) 
# Spark ML Decision Tree Model
model_iris <- tidy_iris %>% ml_decision_tree(response="Species", features=c("Petal_Length","Petal_Width")) 
# Create reference to test data in Spark table
test_iris <- tbl(sc,"spark_iris_test") 
# Bring predictions data back into R memory for plotting:
pred_iris <- sdf_predict(model_iris, test_iris) %>% collect
pred_iris %>%
 inner_join(data.frame(prediction=0:2,
 lab=model_iris$model.parameters$labels)) %>%
 ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
 geom_point() 
spark_disconnect(sc)