燕山大学大数据分析实验报告(纯英文)

由于上传到csdn结构比较混乱,可以进我的主页查看相应的资源,可以下载

【免费】大数据分析实验报告(全英文)资源-CSDN文库

                                                                         

Experimental Report of Big Data Analytics 

College:

Class:

ID:

Name:

Teachers:

Office of Academic Affairs

March 2024

Experimental task 1: Hadoop and Spark construction and application development

Ⅰ Experimental purpose and requirements

(1)Students are required to be able to build Hadoop and Spark environments.

(2)Students are required to be able to start the Spark service process correctly.

(3)Students are required to be able to implement Spark applications and run them correctly.

Ⅱ Experimental environment and software

1. VMware Workstation Pro 16

2. unbuntu14.04-server-amd64

3. jdk1.8

4. scala2.12.6

5. hadoop2.7.7

6. spark2.4.4-bin-hadoop2.7 

Ⅲ Experimental content

Install the Hadoop and Spark environment on a virtual machine or Linux system, and start the daemon process. Use Spark to implement and run the WordCount program.

Ⅵ Test case

Data:

Figure 1-1  input.txt

Code:

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.log4j.{Level, Logger}

object WordCount {

  def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]")

    val sc = new SparkContext(conf)

    val input = sc.textFile("file:/home/liqing/桌面/input.txt")

    val words = input.flatMap(line => line.split(" "))

    val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

    val result = wordCounts.collect()

    result.foreach(println)

    sc.stop()

  }

}

Ⅴ Test results

Figure 1-2  visit 192.168.37.146:50070

Figure 1-3  visit 192.168.37.146:8080

Figure 1-4  Result

WordCount cannot be evaluated using Precision, Recall, and F1 score.


Experimental task 2: Spark MLlib implements linear regression algorithm

Ⅰ Experimental purpose and requirements

(1)Students are required to accurately understand the basic principles of the linear regression analysis algorithm;

(2)Students are required to be able to implement and run the basic linear regression algorithm using the MLlib;

(3)Students are required to run the linear regression algorithm to obtain the fitting curve and analyze the fitting effect.

Ⅱ Experimental environment and software

1. VMware Workstation Pro 16

2. unbuntu14.04-server-amd64

3. jdk1.8

4. scala2.12.6

5. hadoop2.7.7

6. spark2.4.4-bin-hadoop2.7

Ⅲ Experimental content

Implement the linear regression algorithm program using MLlib under Spark and be able to fit the input data set to obtain the demand regression formula. Validation of the fitted curves is performed.

Ⅵ Test case

Code:

import org.apache.log4j.{Level, Logger}

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.{LabeledPoint,

LinearRegressionWithSGD}

import org.apache.spark.{SparkConf, SparkContext}

object LinearRegressionExample{

val DATA_PATH = "/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/ridge-data/lpsa.data"

def main(args: Array[String]): Unit = {

Logger.getLogger("org").setLevel(Level.OFF)

val conf = new SparkConf().setAppName("LinearRegressionExample").setMaster("local[*]")

val sc = new SparkContext(conf)

val data = sc.textFile(DATA_PATH)

val parsedData = data.map(line => {

val parts = line.split(",")

LabeledPoint(parts(0).toDouble,

Vectors.dense(parts(1).split(" ").map(_.toDouble)))})

val numIterations = 100

val model = LinearRegressionWithSGD.train(parsedData,numIterations)

val valuesAndPreds = parsedData.map(point => {

val prediction = model.predict(point.features)

(point.label, prediction)})

val MSE = valuesAndPreds.map {

case (v, p) => math.pow(v - p, 2)

}.reduce(_ + _) / valuesAndPreds.count()

println(MSE)

}

}

Data:

Figure 2-1   lpsa.data

Ⅴ Test results

Figure 2-2  Result

Using MSE for evaluation,MSE=6.207597210613578

Reason:May be due to the non-linear relationship between the features of the input dataset and the target variable, or the dataset containing noise or outliers, or improper setting of numIterations and learning rates.

Experimental task 3: Spark MLlib implement support vector machine algorithm

Ⅰ Experimental purpose and requirements

(1)Students are required to understand the basic principles of the classification algorithm;

(2)Students are required to understand the classification principle of the SVM algorithm;

(3)Students are required to use the Mllib to implement the SVM algorithm and to classify the data.

Ⅱ Experimental environment and software

1. VMware Workstation Pro 16

2. unbuntu14.04-server-amd64

3. jdk1.8

4. scala2.12.6

5. hadoop2.7.7

6. spark2.4.4-bin-hadoop2.7

Ⅲ Experimental content

Support vector machine (SVM) algorithm in classification algorithm is implemented under Spark using MLlib, and relevant data is analyzed using support vector machine.

Ⅵ Test case

Code:

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

import org.apache.spark.mllib.util.MLUtils

import org.apache.log4j.{Level, Logger}

import org.apache.spark.mllib.optimization.L1Updater

object SVMWithSGDExample {

  def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("SVMWithSGDExample").setMaster("local[*]")

    val sc = new SparkContext(conf)

    // Load training data in LIBSVM format.

    val data = MLUtils.loadLibSVMFile(sc, "file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")

    // Split data into training (60%) and test (40%).

    val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

    val training = splits(0).cache()

    val test = splits(1).cache()

  // Run training algorithm to build the model

    val numIterations = 100

    val model = SVMWithSGD.train(training, numIterations)

    // Compute raw scores on the test set.

    val scoreAndLabels = test.map { point =>

      val score = model.predict(point.features)

      (score, point.label)

    }

    val accuracy = 1.0 * scoreAndLabels.filter(x => x._1 == x._2).count() / test.count()

    val tp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 1.0).count().toDouble

    val fp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 0.0).count().toDouble

    val fn = scoreAndLabels.filter(x => x._1 == 0.0 && x._2 == 1.0).count().toDouble

    val precision = tp / (tp + fp)

    val recall = tp / (tp + fn)

    val f1Score = 2 * (precision * recall) / (precision + recall)

    println(s"Accuracy: $accuracy")

    println(s"Precision: $precision")

    println(s"Recall: $recall")

    println(s"F1-score: $f1Score")

    sc.stop()

  }

}

Data:

Figure 3-1  sample_libsvm_data.txt

Ⅴ Test results

Figure 3-2  Result

Accuracy=0.976744180465116

Precision:=1.0

Recall=0.9565217391304348

The F1 score=0.9777777777777777777

Reason:

The accuracy, precision, recall, and F1 score are all very high, but no false positive category predictions have been made, possibly due to imbalanced datasets, or because the number of positive class samples in the dataset far exceeds that of negative class samples.

Experimental task 4: Spark MLlib implement K-means algorithm

Ⅰ Experimental purpose and requirements

(1)Students are required to be able to understand the basic principles of cluster algorithm.

(2)Students are required to be able to understand the principles and process of K-means.

(3)Students are required to be able to implement K-means algorithm and clustering data using Mllib.

Ⅱ Experimental environment and software

1. VMware Workstation Pro 16

2. unbuntu14.04-server-amd64

3. jdk1.8

4. scala2.12.6

5. hadoop2.7.7

6. spark2.4.4-bin-hadoop2.7

Ⅲ Experimental content

The K-means algorithm in classification algorithm is implemented under Spark using MLlib and the data is clustered using the K-means algorithm.

Ⅵ Test case

Code:

import org.apache.spark.ml.clustering.KMeans

import org.apache.spark.ml.evaluation.ClusteringEvaluator

import org.apache.spark.sql.SparkSession

import org.apache.log4j.{Level, Logger}

object KMeansExample {

  def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.OFF)

    val spark = SparkSession.builder()

      .appName("KMeansExample")

      .master("local[*]")

      .getOrCreate()

    // Loads data.

    val dataset = spark.read.format("libsvm").load("file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt")

    // Trains a k-means model.

    val kmeans = new KMeans()

      .setK(2)

      .setSeed(1L)

    val model = kmeans.fit(dataset)

    // Make predictions

    val predictions = model.transform(dataset)

    // Evaluate clustering by computing Silhouette score

    val evaluator = new ClusteringEvaluator()

    val silhouette = evaluator.evaluate(predictions)

    println(s"Silhouette with squared euclidean distance = $silhouette")

    // Shows the result.

    println("Cluster Centers: ")

    model.clusterCenters.foreach(println)

    spark.stop()

  }

}

Data:

Figure 4-1 sample_kmeans_data.txt

Ⅴ Test results

 

Figure 4-2 Result

Silhouette with squared euclidean distance = 0.9997530305375207

Reason:

The clustering effect is very good, with each sample closely related to other samples in its nearest cluster and relatively distant from samples in other clusters, possibly due to the good dataset; Perhaps due to the large differences between clusters, while within clusters there may be small differences; The number and size of clusters are appropriate.

Cover Designer: Li Jia

Tel: +86-335-8057068

Website: http://jwc.ysu.edu.cn

E-mail: xsyj@ysu.edu.cn

No. 438 West Hebei Avenue,

Qinhuangdao, Hebei 066004, P. R. China

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值