大数据管理技术 | 实习五 Spark软件栈体验

实习五 Spark软件栈体验

Spark安装与启动

本次实习采用spark为3.0.0版本。

在根据教程安装后输入

./bin/spark-shell

进入交互模式,界面生成如下结果:

在这里插入图片描述

输入如下代码进行RDD简单操作:

val textFile = sc.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/README.md")
textFile.count()  

textFile.first() 

textFile.filter(line => line.contains("Spark")).count()

textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) 

在这里插入图片描述

1.Spark RDD-WordCount

  1. 清除txt中的标点符号,得到纯单词与空格组成的Shakespeare.txt

    在这里插入图片描述

  2. 导入Shakespeare.txt并计数

val s = sc.textFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/Shakespeare.txt")
s.count

在这里插入图片描述

  1. 设定输出的文件个数并执行统计逻辑

    val numOutputFiles = 1
    val counts = s.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _, numOutputFiles)
  2. 保存计算结果到本地

    counts.saveAsTextFile("file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/Shakespeare_output.txt")

    在这里插入图片描述

  3. 在本地查看结果

    在这里插入图片描述

    在这里插入图片描述

2.Spark SQL

  1. 数据:

    采用教学网上的tmdb-5000-movie-dataset.zip中的数据集。

  2. 目标:

    在该数据集上实现的两个实用的查询功能。

  3. 思路:

    分为预处理和查询两个步骤;从本地导入数据集到python程序后,对于公司为空的电影进行过滤,且对于某个对应多个电影公司的电影,将其按照“production_companies”一栏中拆分,将电影公司按照逗号拆分成多行;随后我们进行两项查询:

    1. 查询平均评分大于6.5的电影

    2. 按照公司将高分电影的收入加起来,然后列出前20名收入高的公司

  4. 代码:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import explode
    from pyspark.sql.functions import split
    from pyspark.sql.functions import col
    from pyspark.sql.functions import explode_outer
    from pyspark.sql import functions as F
    
    import sys
    sys.path.append("/usr/local/spark/spark-3.0.0-bin-hadoop2.7/python/pyspark/mllib/Movie")
    import pyspark
    spark = pyspark.sql.SparkSession.builder.appName("SimpleApp").getOrCreate()
    sc = spark.sparkContext
    
    df=spark.read.csv('file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/python/pyspark/mllib/Movie/tmdb_5000_movies.csv',inferSchema=True,header=True)
    df.printSchema()
    
    #过滤掉出品公司为空的电影
    df_filter=df.filter(df['production_companies']!='[]')
    
    #将电影公司按逗号拆分成多行
    df_whith=df_filter.withColumn('production_companies_tmp',explode(split("production_companies", ",")))
    df_whith.select('production_companies_tmp').show(10)
    
    #查询评分>6.5的前5部电影
    df_where = df_whith.where(F.col("vote_average")>'6.5')
    df_where.show(5)
    
    #查询收入为前20名的公司
    df_res=df_where.groupBy('original_title','production_companies_tmp').agg({"revenue": "sum"}).withColumnRenamed("sum(revenue)", "sum_revenue").orderBy(F.desc('sum_revenue'))
    df_res.show(20)
  5. 运行结果:

    在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在shell中结果显示如下:

lz@lz-virtual-machine:/usr/local/spark/spark-3.0.0-bin-hadoop2.7/python/pyspark/mllib/Movie$ python3 movie-analysis.py
20/08/11 22:35:58 WARN util.Utils: Your hostname, lz-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.152.128 instead (on interface ens33)
20/08/11 22:35:58 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/08/11 22:35:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
root                                                                            
 |-- budget: integer (nullable = true)
 |-- genres: string (nullable = true)
 |-- homepage: string (nullable = true)
 |-- id: string (nullable = true)
 |-- keywords: string (nullable = true)
 |-- original_language: string (nullable = true)
 |-- original_title: string (nullable = true)
 |-- overview: string (nullable = true)
 |-- popularity: string (nullable = true)
 |-- production_companies: string (nullable = true)
 |-- production_countries: string (nullable = true)
 |-- release_date: string (nullable = true)
 |-- revenue: string (nullable = true)
 |-- runtime: string (nullable = true)
 |-- spoken_languages: string (nullable = true)
 |-- status: string (nullable = true)
 |-- tagline: string (nullable = true)
 |-- title: string (nullable = true)
 |-- vote_average: string (nullable = true)
 |-- vote_count: string (nullable = true)

+------------------------+
|production_companies_tmp|
+------------------------+
|    http://www.avatar...|
|          "[{""id"": 270|
|          "[{""id"": 470|
|    http://www.thedar...|
|          "[{""id"": 818|
|          "[{""id"": 851|
|           {""id"": 2343|
|         "[{""id"": 8828|
|          "[{""id"": 616|
|          "[{""id"": 849|
+------------------------+
only showing top 10 rows

+---------+-------------+--------------------+---------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+----------------+--------------------+-----+--------------------+--------------------+------------------------+
|   budget|       genres|            homepage|             id|            keywords|   original_language|      original_title|            overview|          popularity|production_companies|production_countries|        release_date|             revenue|          runtime|    spoken_languages|          status|             tagline|title|        vote_average|          vote_count|production_companies_tmp|
+---------+-------------+--------------------+---------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+----------------+--------------------+-----+--------------------+--------------------+------------------------+
|105000000|"[{""id"": 18| ""name"": ""Dram...| {""id"": 10749| ""name"": ""Roma...|                null|               64682|      "[{""id"": 818| ""name"": ""base...|       {""id"": 1326| ""name"": ""infi...|       {""id"": 1523| ""name"": ""obse...|    {""id"": 3929| ""name"": ""hope""}| {""id"": 209714| ""name"": ""3d""}]"|   en|    The Great Gatsby|An adaptation of ...|           {""id"": 1326|
|185000000|"[{""id"": 28| ""name"": ""Acti...|    {""id"": 12| ""name"": ""Adve...|        {""id"": 878| ""name"": ""Scie...|http://www.startr...|              188927|     "[{""id"": 9663| ""name"": ""sequ...|       {""id"": 9743| ""name"": ""stra...|  {""id"": 158449| ""name"": ""hatr...| {""id"": 161176| ""name"": ""spac...|   en|    Star Trek Beyond|The USS Enterpris...|         "[{""id"": 9663|
|180000000|"[{""id"": 28| ""name"": ""Acti...|    {""id"": 12| ""name"": ""Adve...|http://legendofta...|              258489|      "[{""id"": 409| ""name"": ""afri...|       {""id"": 5650| ""name"": ""fera...|       {""id"": 7347| ""name"": ""tarz...|   {""id"": 10787| ""name"": ""jung...| {""id"": 158130| ""name"": ""anim...|   en|The Legend of Tarzan|Tarzan, having ac...|           {""id"": 5650|
|175000000|"[{""id"": 16| ""name"": ""Anim...| {""id"": 10751| ""name"": ""Fami...|         {""id"": 12| ""name"": ""Adve...|        {""id"": 878| ""name"": ""Scie...|http://www.monste...|               15512|     "[{""id"": 9951| ""name"": ""alie...|   {""id"": 10891| ""name"": ""gian...| {""id"": 179431| ""name"": ""duri...|   en|  Monsters vs Aliens|When Susan Murphy...|    http://www.monste...|
|165000000|"[{""id"": 35| ""name"": ""Come...|    {""id"": 12| ""name"": ""Adve...|         {""id"": 14| ""name"": ""Fant...|         {""id"": 16| ""name"": ""Anim...|      {""id"": 10751| ""name"": ""Fami...|http://www.shrekf...|               10192|"[{""id"": 189111| ""name"": ""ogre""}| {""id"": 209714| ""name"": ""3d""}]"|   en| Shrek Forever After|A bored and domes...|          {""id"": 10751|
+---------+-------------+--------------------+---------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+----------------+--------------------+-----+--------------------+--------------------+------------------------+
only showing top 5 rows


+--------------------+------------------------+-----------+                     
|      original_title|production_companies_tmp|sum_revenue|
+--------------------+------------------------+-----------+
| ""name"": ""Anim...|             {""id"": 14|   417859.0|
| ""name"": ""Dram...|             {""id"": 53|   316152.0|
| ""name"": ""Acti...|             {""id"": 35|   173931.0|
| ""name"": ""Roma...|             {""id"": 53|    51955.0|
| ""name"": ""Fant...|           {""id"": 9648|    45649.0|
| ""name"": ""Thri...|          {""id"": 10749|    42586.0|
| ""name"": ""Come...|          {""id"": 10769|    33106.0|
| ""name"": ""Thri...|           {""id"": 9648|    26268.0|
| ""name"": ""Come...|          {""id"": 10402|    20391.0|
| ""name"": ""Adve...|             {""id"": 35|    19585.0|
| ""name"": ""Myst...|          {""id"": 10749|    14283.0|
| ""name"": ""Fant...|          {""id"": 10751|    10192.0|
| ""name"": ""Thri...|             {""id"": 14|     9707.0|
| ""name"": ""Adve...|             {""id"": 18|     9588.0|
| ""name"": ""Come...|           {""id"": 9648|     7364.0|
| ""name"": ""Myst...|             {""id"": 53|     2779.0|
|               21301|                      en|   6.501815|
|               96399|                      en|   5.759545|
|               20761|                      en|   5.566203|
|               25461|                      en|   3.643662|
+--------------------+------------------------+-----------+
only showing top 20 rows

3.Spark MLlib之Titanic

  1. 数据:

    采用教学网上的TitanicTrainTest.zip中的数据作为训练集和测试集。

  2. 目标:

    采用教学网上的TitanicTrainTest.zip中给出的训练集数据训练分类模型,并且给出在测试集中的正确率

  3. 思路:

    对数据进行预处理,例如:对缺省数据可以采取用平均值补充等方式来进行填补;随后用pyspark中 的mllib中的 模型,基于已知数据进行训练,再从测试集上给出模型的正确率。

  4. data_preload.py

    首先对数据进行预加载,载入“trainwithlabels.csv”与“testwithlabels.csv”中;初步处理一下,Age缺省者用平均值补充,将文字类标签都转化为数字/字母,简化标签;存储到 “processtestwithlabels.csv”与"processtrainwithlabels.csv"中。

    import pandas as pd
    
    def fixing(path,doc):
        try:
            titanic = pd.read_csv(path + doc)
        except:
            print("can't find {doc} in {path}! Please check you path.")
            raise(ValueError)
        titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())   #fillna()表示补充,median()表示求平均值
        titanic.loc[titanic['Sex']=='male','Sex']=0
        titanic.loc[titanic['Sex']=='female','Sex']=1
        titanic['Embarked']=titanic['Embarked'].fillna('S')
        titanic.loc[titanic['Embarked']=='S','Embarked']=0
        titanic.loc[titanic['Embarked']=='C','Embarked']=1
        titanic.loc[titanic['Embarked']=='Q','Embarked']=2
        titanic = titanic.drop(['Name','Ticket','Cabin'],axis=1)
        titanic.to_csv("./" + "process" + doc,index=False)
    
    if __name__ == '__main__':
        fixing("./","trainwithlabels.csv")
        fixing("./","testwithlabels.csv")
        print("finish successfully!")
  5. mllib_data_process.py

    对数据进行一系列处理,使得数据更好/作为mllib训练数据输入时符合mllib所要求的格式。导入的数据为“processtestwithlabels.csv”与"processtrainwithlabels.csv",输出到“train_mllib.data”与 “test_mllib.data”中。

    from pyspark.mllib.linalg import SparseVector
    from pyspark.mllib.regression import LabeledPoint 
    import numpy as np
    import pandas as pd
    
    
    def feature_normalize(data):
        mu = np.mean(data,axis=0) 
        std = np.std(data,axis=0) 
        return (data - mu)/std
    
    
    def save_mllib(path,name,type):
        file = pd.read_csv(path+name)
        label = file['Survived'].values
        temp_data = file[["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].values
        temp_data = feature_normalize(temp_data)
        output = ''
        for i in range(len(label)):
            output += str(label[i]) + " 1:"
            for j in range(7):
                output += str(temp_data[i][j]) 
                if j == 6:
                    output += '\n'
                else:
                    output += " " + str(j + 2) + ":"
        if type == "train":
            newfile = open(path+"train_mllib.data",'w')
        else:
            newfile = open(path+"test_mllib.data",'w')
        newfile.write(output)
        newfile.close()
    
    if __name__ == "__main__":
        save_mllib("./","processtrainwithlabels.csv","train")
        save_mllib("./","processtestwithlabels.csv","test")
        
  6. mllib.py

    训练模型。调用SVMwithSGD中的训练方式,基于“train_mllib.data”与 “test_mllib.data”进行训练;随后将该模型应用于测试集test_data上进行预测,并且与真正的label进行比对,算出预测成功的数量,从而计算得出预算成功的准确率。

    from pyspark.mllib.util import MLUtils
    from pyspark.mllib.classification import SVMWithSGD
    
    import sys
    sys.path.append("/usr/local/spark/spark-3.0.0-bin-hadoop2.7/python")
    import pyspark
    spark = pyspark.sql.SparkSession.builder.appName("SimpleApp").getOrCreate()
    sc = spark.sparkContext
    
    train_data = MLUtils.loadLibSVMFile(sc=sc,path='file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/python/pyspark/mllib/Titanic/train_mllib.data')
    test_data = MLUtils.loadLibSVMFile(sc=sc,path='file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/python/pyspark/mllib/Titanic/test_mllib.data')
    
    model = SVMWithSGD.train(train_data, iterations=100, step=1, miniBatchFraction=1.0, regParam=0.01, regType="l2")
    prediction = model.predict(test_data.map(lambda x :x.features)).collect()
    true_label = test_data.map(lambda x :x.label).collect()
    account = 0
    for index in range(len(true_label)):
    	if true_label[index] == prediction[index]:
    		account += 1
    print("accuracy: {}".format(account/len(true_label)))
    print("done")
  7. 在shell中操作:

    cd /usr/local/spark/spark-3.0.0-bin-hadoop2.7/python/pyspark/mllib/Titanic
    pip3 install pandas//事先没有安装pandas
    python3 data_preload.py
    python3 mllib_data_process.py
    python3 mllib.py

    在这里插入图片描述

    在这里插入图片描述

    最后一步结果为

    20/08/11 21:20:16 WARN util.Utils: Your hostname, lz-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.152.128 instead (on interface ens33)
    20/08/11 21:20:16 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    20/08/11 21:20:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    accuracy: 0.81  

    可以看出训练出的该模型在测试集上准确率为0.81。

  8. 一些错误类型总结:

    1. 在运行时需要开启hadoop

      start-dfs.sh
      start-yarn.sh
      mr-jobhistory-daemon.sh start historyserver

      否则会报错fail connection

    2. 在运行mllib.py时遇到了hdfs空间文件上传出问题的情况,屡次格式化等操作后都不成功后,在py文件中将

      path='./train_mllib.data'
      

      改成了本地绝对路径

      path='file:///usr/local/spark/spark-3.0.0-bin-hadoop2.7/python/pyspark/mllib/Titanic/train_mllib.data'

      这样,在运行mllib.py时便不会产生报错说非要去找hadoop空间中的文件:

      Input path does not exist: hdfs://localhost:9000/user/lz/train_mllib.data

      而是直接从本地文件中找到了。

4.GraphX再现PageRank

  1. 数据:

    采用教学网上的network.jpg,写成vertex和edge两个txt文档,edge.txt中每一行为顶点i指向顶点j的形式,手动将无向图的一条边改写为有向图的两条边。

    在这里插入图片描述

  2. 程序

    代码如下:(只整理了scala中需要输入的命令)

    scala> import org.apache.spark.graphx.GraphLoader
    
    
    scala> val graph = GraphLoader.edgeListFile(sc,"file:///home/lz/Desktop/homework/edge.txt")
    //导入边数据
    
    scala> val ranks = graph.pageRank(0.0001).vertices
    //运行PageRank
    
    scala> val users= sc.textFile("file:///home/lz/Desktop/homework/vertex.txt").map { line =>
         |   val fields = line.split(",")
         |   (fields(0).toLong, fields(1))
         |   }
    //导入顶点对应关系
    
    scala> val ranksByUsername = users.join(ranks).map {
         |   case (id, (username, rank)) => (username, rank)
         | }
    //整理
    
    //输出
    scala> println(ranksByUsername.collect().mkString("\n"))
    (D,1.4715087621628664)
    (A,1.0192604591304733)
    (F,1.2890402583321614)
    (C,0.7942640133346549)
    (G,1.2890402583321614)
    (I,0.8567906329897478)
    (H,0.9524103919494304)
    (J,0.5141607513033776)
    (E,0.7942640133346549)
    (B,1.0192604591304733)

    运行结果如图:

在这里插入图片描述

在这里插入图片描述

最终迭代多次稳定后结果为:

(D,1.4715087621628664)
(A,1.0192604591304733)
(F,1.2890402583321614)
(C,0.7942640133346549)
(G,1.2890402583321614)
(I,0.8567906329897478)
(H,0.9524103919494304)
(J,0.5141607513033776)
(E,0.7942640133346549)
(B,1.0192604591304733)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值