spark for python : linux

ubuntu 14.04
Download Softwares [ if web connection refused, go to http://blog.csdn.net/houxiaoqin/article/details/54096175 ]:
Anaconda2-4.2.0-Linux-x86_64.sh 【 http://continuum.io/downloads#all
jdk-8u111-linux-x64.tar.gz 【 http://continuum.io/downloads#all
spark-1.5.2-bin-hadoop2.6.tgz 【 http://spark.apache.org/downloads


1. Installing Java 8
cd Downloads
ls
mkdir -p /usr/lib/jvm
sudo mv jdk-8u111-linux-x64.tar.gz /usr/lib/jvm
cd /usr/lib/jvm
sudo tar xzvf jdk-8u11-linux-i586.tar.gz
sudo ln -s jdk1.8.0_111 java-8

设置环境变量
vi ~/.bashrc
在文件末尾追加:
export JAVA_HOME=/usr/lib/jvm/java-8
export JAVA_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
:wq保存退出
source ~/.bashrc     使上述修改生效
java-version

【注意::$PATH不写、写错或者写成小写,则原始路径直接被覆盖,导致系统找不到原始路径,ls等基础命令失效】
【如果不小心覆盖了原始路径,命令行输入export PATH=/usr/bin:/bin 然后输入 vi ~/.bashrc 找错改正】

更新默认jdk
# update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_05/bin/java 300
# update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_05/bin/javac 300
# update-alternatives --config java


2. Installing Anaconda with Python 2.7
命令:bash Anaconda2-4.2.0-Linux-x86_64.sh

3. Installing Spark
cd Downloads
ls
mkdir -p /home/用户名/spark
tar -xf spark-1.5.2-bin-hadoop2.6.tgz
sudo mv spark-1.5.2-bin-hadoop2.6 /home/用户名/spark
cd /home/用户名/spark/spark-1.5.2-bin-hadoop2.6
./bin/pyspark

4. Enabling Jupyter Notebook
命令:jupyter notebook

connect Jupyter to Spark:

命令:PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark --master local[*]

Where $SPARK_HOME: environment variable set to the Spark home directory


5. Virtualizing the environment with Docker
$ sudo apt-get update
$ sudo apt- get install curl
$ curl -fsSL https:/ /get.docker.com/ | sh
如果报错dpkg: error processing package oracle-java8-installer (--configure), 很可能/var/cache/oracle-jdk8-installer/jdk-8u111-linux-x64.tar.gz残缺【ls -lht查看】。
sudo mv /usr/lib/jvm/jdk-8u111-linux-x64.tar.gz /var/cache/oracle-jdk8-installer/
curl -fsSL https:/ /get.docker.com/ | sh
sudo docker version






Paperback: 146 pages Publisher: Packt Publishing - ebooks Account (February 4, 2016) Language: English ISBN-10: 1784399698 ISBN-13: 978-1784399696 Key Features Set up real-time streaming and batch data intensive infrastructure using Spark and Python Deliver insightful visualizations in a web app using Spark (PySpark) Inject live data using Spark Streaming with real-time events Book Description Looking for a cluster computing system that provides high-level APIs? Apache Spark is your answer―an open source, fast, and general purpose cluster computing system. Spark's multi-stage memory primitives provide performance up to 100 times faster than Hadoop, and it is also well-suited for machine learning algorithms. Are you a Python developer inclined to work with Spark engine? If so, this book will be your companion as you create data-intensive app using Spark as a processing engine, Python visualization libraries, and web frameworks such as Flask. To begin with, you will learn the most effective way to install the Python development environment powered by Spark, Blaze, and Bookeh. You will then find out how to connect with data stores such as MySQL, MongoDB, Cassandra, and Hadoop. You'll expand your skills throughout, getting familiarized with the various data sources (Github, Twitter, Meetup, and Blogs), their data structures, and solutions to effectively tackle complexities. You'll explore datasets using iPython Notebook and will discover how to optimize the data models and pipeline. Finally, you'll get to know how to create training datasets and train the machine learning models. By the end of the book, you will have created a real-time and insightful trend tracker data-intensive app with Spark.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值