燕山大学大数据分析实验报告（纯英文）

最新推荐文章于 2024-09-27 22:44:03 发布

哇哇哈哈哈哈哈

最新推荐文章于 2024-09-27 22:44:03 发布

阅读量1.7k

点赞数 41

分类专栏：燕山大学各种作业文章标签：数据分析数据挖掘大数据 spark hadoop

本文链接：https://blog.csdn.net/qq_63015047/article/details/139306229

版权

燕山大学各种作业专栏收录该内容

12 篇文章 1 订阅

订阅专栏

由于上传到csdn结构比较混乱，可以进我的主页查看相应的资源，可以下载

【免费】大数据分析实验报告（全英文）资源-CSDN文库

Experimental Report of Big Data Analytics

College：

Class：

ID：

Name：

Teachers：

Office of Academic Affairs

March 2024

Experimental task 1: Hadoop and Spark construction and application development

Ⅰ Experimental purpose and requirements

（1）Students are required to be able to build Hadoop and Spark environments.

（2）Students are required to be able to start the Spark service process correctly.

（3）Students are required to be able to implement Spark applications and run them correctly.

Ⅱ Experimental environment and software

1. VMware Workstation Pro 16

2. unbuntu14.04-server-amd64

3. jdk1.8

4. scala2.12.6

5. hadoop2.7.7

6. spark2.4.4-bin-hadoop2.7

Ⅲ Experimental content

Install the Hadoop and Spark environment on a virtual machine or Linux system, and start the daemon process. Use Spark to implement and run the WordCount program.

Ⅵ Test case

Data:

Figure 1-1 input.txt

Code:

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.log4j.{Level, Logger}

object WordCount {

def main(args: Array[String]): Unit = {

Logger.getLogger("org").setLevel(Level.OFF)

val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]")

val sc = new SparkContext(conf)

val input = sc.textFile("file:/home/liqing/桌面/input.txt")

val words = input.flatMap(line => line.split(" "))

val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

val result = wordCounts.collect()

result.foreach(println)

sc.stop()

}

Ⅴ Test results

Figure 1-2 visit 192.168.37.146:50070

Figure 1-3 visit 192.168.37.146:8080

Figure 1-4 Result

WordCount cannot be evaluated using Precision, Recall, and F1 score.

Experimental task 2: Spark MLlib implements linear regression algorithm

Ⅰ Experimental purpose and requirements

（1）Students are required to accurately understand the basic principles of the linear regression analysis algorithm；

（2）Students are required to be able to implement and run the basic linear regression algorithm using the MLlib；

（3）Students are required to run the linear regression algorithm to obtain the fitting curve and analyze the fitting effect.

Ⅱ Experimental environment and software

1. VMware Workstation Pro 16

2. unbuntu14.04-server-amd64

3. jdk1.8

4. scala2.12.6

5. hadoop2.7.7

6. spark2.4.4-bin-hadoop2.7

Ⅲ Experimental content

Implement the linear regression algorithm program using MLlib under Spark and be able to fit the input data set to obtain the demand regression formula. Validation of the fitted curves is performed.

Ⅵ Test case

Code:

import org.apache.log4j.{Level, Logger}

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.{LabeledPoint,

LinearRegressionWithSGD}

import org.apache.spark.{SparkConf, SparkContext}

object LinearRegressionExample{

val DATA_PATH = "/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/ridge-data/lpsa.data"

def main(args: Array[String]): Unit = {

Logger.getLogger("org").setLevel(Level.OFF)

val conf = new SparkConf().setAppName("LinearRegressionExample").setMaster("local[*]")

val sc = new SparkContext(conf)

val data = sc.textFile(DATA_PATH)

val parsedData = data.map(line => {

val parts = line.split(",")

LabeledPoint(parts(0).toDouble,

Vectors.dense(parts(1).split(" ").map(_.toDouble)))})

val numIterations = 100

val model = LinearRegressionWithSGD.train(parsedData,numIterations)

val valuesAndPreds = parsedData.map(point => {

val prediction = model.predict(point.features)

(point.label, prediction)})

val MSE = valuesAndPreds.map {

case (v, p) => math.pow(v - p, 2)

}.reduce(_ + _) / valuesAndPreds.count()

println(MSE)

}

Data:

Figure 2-1 lpsa.data

Ⅴ Test results

Figure 2-2 Result

Using MSE for evaluation,MSE=6.207597210613578

Reason:May be due to the non-linear relationship between the features of the input dataset and the target variable, or the dataset containing noise or outliers, or improper setting of numIterations and learning rates.

Experimental task 3: Spark MLlib implement support vector machine algorithm

Ⅰ Experimental purpose and requirements

（1）Students are required to understand the basic principles of the classification algorithm；

（2）Students are required to understand the classification principle of the SVM algorithm；

（3）Students are required to use the Mllib to implement the SVM algorithm and to classify the data.

Ⅱ Experimental environment and software

1. VMware Workstation Pro 16

2. unbuntu14.04-server-amd64

3. jdk1.8

4. scala2.12.6

5. hadoop2.7.7

6. spark2.4.4-bin-hadoop2.7

Ⅲ Experimental content

Support vector machine (SVM) algorithm in classification algorithm is implemented under Spark using MLlib, and relevant data is analyzed using support vector machine.

Ⅵ Test case

Code:

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

import org.apache.spark.mllib.util.MLUtils

import org.apache.log4j.{Level, Logger}

import org.apache.spark.mllib.optimization.L1Updater

object SVMWithSGDExample {

def main(args: Array[String]): Unit = {

Logger.getLogger("org").setLevel(Level.OFF)

val conf = new SparkConf().setAppName("SVMWithSGDExample").setMaster("local[*]")

val sc = new SparkContext(conf)

// Load training data in LIBSVM format.

val data = MLUtils.loadLibSVMFile(sc, "file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0).cache()

val test = splits(1).cache()

// Run training algorithm to build the model

val numIterations = 100

val model = SVMWithSGD.train(training, numIterations)

// Compute raw scores on the test set.

val scoreAndLabels = test.map { point =>

val score = model.predict(point.features)

(score, point.label)

}

val accuracy = 1.0 * scoreAndLabels.filter(x => x._1 == x._2).count() / test.count()

val tp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 1.0).count().toDouble

val fp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 0.0).count().toDouble

val fn = scoreAndLabels.filter(x => x._1 == 0.0 && x._2 == 1.0).count().toDouble

val precision = tp / (tp + fp)

val recall = tp / (tp + fn)

val f1Score = 2 * (precision * recall) / (precision + recall)

println(s"Accuracy: $accuracy")

println(s"Precision: $precision")

println(s"Recall: $recall")

println(s"F1-score: $f1Score")

sc.stop()

}

Data:

Figure 3-1 sample_libsvm_data.txt

Ⅴ Test results

Figure 3-2 Result

Accuracy=0.976744180465116

Precision:=1.0

Recall=0.9565217391304348

The F1 score=0.9777777777777777777

Reason:

The accuracy, precision, recall, and F1 score are all very high, but no false positive category predictions have been made, possibly due to imbalanced datasets, or because the number of positive class samples in the dataset far exceeds that of negative class samples.

Experimental task 4: Spark MLlib implement K-means algorithm

Ⅰ Experimental purpose and requirements

（1）Students are required to be able to understand the basic principles of cluster algorithm.

（2）Students are required to be able to understand the principles and process of K-means.

（3）Students are required to be able to implement K-means algorithm and clustering data using Mllib.

Ⅱ Experimental environment and software

1. VMware Workstation Pro 16

2. unbuntu14.04-server-amd64

3. jdk1.8

4. scala2.12.6

5. hadoop2.7.7

6. spark2.4.4-bin-hadoop2.7

Ⅲ Experimental content

The K-means algorithm in classification algorithm is implemented under Spark using MLlib and the data is clustered using the K-means algorithm.

Ⅵ Test case

Code:

import org.apache.spark.ml.clustering.KMeans

import org.apache.spark.ml.evaluation.ClusteringEvaluator

import org.apache.spark.sql.SparkSession

import org.apache.log4j.{Level, Logger}

object KMeansExample {

def main(args: Array[String]): Unit = {

Logger.getLogger("org").setLevel(Level.OFF)

val spark = SparkSession.builder()

.appName("KMeansExample")

.master("local[*]")

.getOrCreate()

// Loads data.

val dataset = spark.read.format("libsvm").load("file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt")

// Trains a k-means model.

val kmeans = new KMeans()

.setK(2)

.setSeed(1L)

val model = kmeans.fit(dataset)

// Make predictions

val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score

val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)

println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.

println("Cluster Centers: ")

model.clusterCenters.foreach(println)

spark.stop()

}

Data:

Figure 4-1 sample_kmeans_data.txt

Ⅴ Test results

Figure 4-2 Result

Silhouette with squared euclidean distance = 0.9997530305375207

Reason:

The clustering effect is very good, with each sample closely related to other samples in its nearest cluster and relatively distant from samples in other clusters, possibly due to the good dataset; Perhaps due to the large differences between clusters, while within clusters there may be small differences; The number and size of clusters are appropriate.