文章目录
一.在Linux上安装Anaconda
- 下载Anaconda
https://www.anaconda.com/distribution/
- 命令安装Anaconda,除了vscode选择no其他都选择yes
bash Anaconda3-5.1.0-Linux-x86_64.sh
#spark集成
#安装anconda 除了vscode其他一律选yes或者按enter vscode不需要安装选择no
bash /opt/software/Anaconda3-5.0.1-Linux-x86_64.sh
#配置anaconda3环境 集成spark前提安装了spark
echo 'export SPARK_CONF_DIR=$SPARK_HOME/conf' >> /etc/profile
echo 'export ANACONDA_HOME=/root/anaconda3' >> /etc/profile
echo 'export PATH=$PATH:$ANACONDA_HOME/bin' >> /etc/profile
echo 'export PYSPARK_DRIVER_PYTHON=jupyter-notebook' >> /etc/profile
echo 'export PYSPARK_DRIVER_PYTHON_OPTS=" --ip=0.0.0.0 --port=8888 --allow-root"' >> /etc/profile
echo 'export PYSPARK_PYTHON=/root/anaconda3/bin/python' >> /etc/profile
echo 'export PYSPARK_PYTHON=/root/anaconda3/bin/python' >> /opt/install/spark/conf/spark-env.sh
source /etc/profile
cd ~
#生成jupyter配置文件 #若以前有则覆盖
jupyter notebook --generate-config
cd /root/.jupyter/
#修改jupyter登录密码
ipython
#In [1]: from notebook.auth import passwd
#In [2]: passwd()
#Enter password:
#Verify password:
#Out[4]: 'sha1:9a85ae2b62e2:10849310f951734b0e0b1f9615c92f249272b078' 记住这里密码配置文件需要用到
#修改jupyter_notebook_config.py配置文件
echo 'c.Not200000300ebookApp.allow_root=True' >> /root/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.ip='*'" >> /root/.jupyter/jupyter_notebook_config.py
echo 'c.NotebookApp.open_browser=False' >> /root/.jupyter/jupyter_notebook_config.py
#将刚才密码输入到这来,放在u后面
echo "c.NotebookApp.password=u'sha1:9a85ae2b62e2:10849310f951734b0e0b1f9615c92f249272b078'" >> /root/.jupyter/jupyter_notebook_config.py
echo 'c.NotebookApp.port=7070' >> /root/.jupyter/jupyter_notebook_config.py
#启动pyspark(先启动spark相关服务)
pyspark
二.PySpark简介
PySpark的使用场景
- 大数据处理或机器学习时的原型( prototype)开发
验证算法
执行效率可能不高
要求能够快速开发 - PySpark结构体系
PySpark包介绍
- PySpark
Core Classes:
pyspark.SparkContext
pyspark.RDD
pyspark.sql.SQLContext
pyspark.sql.DataFrame - pyspark.streaming
pyspark.streaming.StreamingContext
pyspark.streaming.DStream - pyspark.ml
- pyspark.mllib
使用PySpark处理数据
- 导包
from pyspark import SparkContext
- 获取SparkContext对象
SparkContext.getOrCreate()
创建RDD
- 不支持 makeRDD()
- 支持 parallelize(), textFile(), wholeTextFiles()
PySpark中使用匿名函数
- Scala语言
val a=sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"))
val b=a.map(x=>(x,1))
b.collect
- Python语言
a=sc.parallelize(("dog","tiger","lion","cat","panther","eagle"))
b=a.map(lambda x:(x,1))
b.collect()
SparkContext.addPyFile
- addFile(path, recursive = False)
接收本地文件
通过SparkFiles.get()方法来获取文件的绝对路径 - addPyFile( path )
加载一个已存在的Python文件 - 加载已存在的文件并调用其中的方法
#sci.py
def sqrt(num):
return num * num
def circle_area(r):
return 3.14 * sqrt(r)
sc.addPyFile("file:///root/sci.py")
from sci import circle_area
sc.parallelize([5, 9, 21]).map(lambda x : circle_area(x)).collect()
在PySpark中使用SparkSQL
- 导包
from pyspark.sql import SparkSession
- 创建SparkSession对象
spark = SparkSession.builder.getOrCreate()
- 加载csv文件
spark.read.format("csv").option("header", "true").load("file:///xxx.csv")
三.案例
1.数据探索:统计寿命预期数据的整体数据信息
from pyspark.sql import SparkSession
# create the spark session
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
# load the data
df = spark.read.format("csv").option("delimiter", " ").load("file:///root/example/LifeExpentancy.txt") \
.withColumn("Country", col("_c0")) \
.withColumn("LifeExp", col("_c2").cast(DoubleType())) \
.withColumn("Region", col("_c4")) \
.select(col("Country"), col("LifeExp"), col("Region"))
df.describe("LifeExp").show()
2.Spark与Python第三方库混用
使用Spark做大数据ETL
处理后的数据使用Python第三方库分析或展示
- Pandas做数据分析
- Pandas DataFrame 转 Spark DataFrame
spark.createDataFrame(pandas_df)
- Spark DataFrame转Pandas DataFrame
spark_df.toPandas()
- Matplotlib实现数据可视化
- Scikit-learn完成机器学习
- PandasDF与SparkDF间的转换方法
# Pandas DataFrame to Spark DataFrame
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pandas_df = pd.read_csv("./products.csv", header=None, usecols=[1, 3, 5])
print(pandas_df)
# convert to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
spark_df.show()
df = spark_df.withColumnRenamed("1", "id").withColumnRenamed("3", "name").withColumnRenamed("5", "remark")
# convert back to Pandas DataFrame
df.toPandas()
3.使用PySpark通过图形进行数据探索
- 将数据划分为多个区间,并统计区间中的数据个数
# from previous LifeExpentancy example
rdd = df.select("LifeExp").rdd.map(lambda x: x[0])
#把数据划为10个区间,并获得每个区间中的数据个数
(countries, bins) = rdd.histogram(10)
print(countries)
print(bins)
import matplotlib.pyplot as plt
import numpy as np
plt.hist(rdd.collect(), 10) # by default the # of bins is 10
plt.title("Life Expectancy Histogram")
plt.xlabel("Life Expectancy")
plt.ylabel("Countries")