由于上传到csdn结构比较混乱,可以进我的主页查看相应的资源,可以下载
Experimental Report of Big Data Analytics
College:
Class:
ID:
Name:
Teachers:
Office of Academic Affairs
March 2024
Experimental task 1: Hadoop and Spark construction and application development
Ⅰ Experimental purpose and requirements
(1)Students are required to be able to build Hadoop and Spark environments.
(2)Students are required to be able to start the Spark service process correctly.
(3)Students are required to be able to implement Spark applications and run them correctly.
Ⅱ Experimental environment and software
1. VMware Workstation Pro 16
2. unbuntu14.04-server-amd64
3. jdk1.8
4. scala2.12.6
5. hadoop2.7.7
6. spark2.4.4-bin-hadoop2.7
Ⅲ Experimental content
Install the Hadoop and Spark environment on a virtual machine or Linux system, and start the daemon process. Use Spark to implement and run the WordCount program.
Ⅵ Test case
Data:
Figure 1-1 input.txt
Code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.{Level, Logger}
object WordCount {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
val sc = new SparkContext(conf)
val input = sc.textFile("file:/home/liqing/桌面/input.txt")
val words = input.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
val result = wordCounts.collect()
result.foreach(println)
sc.stop()
}
}
Ⅴ Test results
Figure 1-2 visit 192.168.37.146:50070
Figure 1-3 visit 192.168.37.146:8080
Figure 1-4 Result
WordCount cannot be evaluated using Precision, Recall, and F1 score.
Experimental task 2: Spark MLlib implements linear regression algorithm
Ⅰ Experimental purpose and requirements
(1)Students are required to accurately understand the basic principles of the linear regression analysis algorithm;
(2)Students are required to be able to implement and run the basic linear regression algorithm using the MLlib;
(3)Students are required to run the linear regression algorithm to obtain the fitting curve and analyze the fitting effect.
Ⅱ Experimental environment and software
1. VMware Workstation Pro 16
2. unbuntu14.04-server-amd64
3. jdk1.8
4. scala2.12.6
5. hadoop2.7.7
6. spark2.4.4-bin-hadoop2.7
Ⅲ Experimental content
Implement the linear regression algorithm program using MLlib under Spark and be able to fit the input data set to obtain the demand regression formula. Validation of the fitted curves is performed.
Ⅵ Test case
Code:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint,
LinearRegressionWithSGD}
import org.apache.spark.{SparkConf, SparkContext}
object LinearRegressionExample{
val DATA_PATH = "/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/ridge-data/lpsa.data"
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("LinearRegressionExample").setMaster("local[*]")
val sc = new SparkContext(conf)
val data = sc.textFile(DATA_PATH)
val parsedData = data.map(line => {
val parts = line.split(",")
LabeledPoint(parts(0).toDouble,
Vectors.dense(parts(1).split(" ").map(_.toDouble)))})
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData,numIterations)
val valuesAndPreds = parsedData.map(point => {
val prediction = model.predict(point.features)
(point.label, prediction)})
val MSE = valuesAndPreds.map {
case (v, p) => math.pow(v - p, 2)
}.reduce(_ + _) / valuesAndPreds.count()
println(MSE)
}
}
Data:
Figure 2-1 lpsa.data
Ⅴ Test results
Figure 2-2 Result
Using MSE for evaluation,MSE=6.207597210613578
Reason:May be due to the non-linear relationship between the features of the input dataset and the target variable, or the dataset containing noise or outliers, or improper setting of numIterations and learning rates.
Experimental task 3: Spark MLlib implement support vector machine algorithm
Ⅰ Experimental purpose and requirements
(1)Students are required to understand the basic principles of the classification algorithm;
(2)Students are required to understand the classification principle of the SVM algorithm;
(3)Students are required to use the Mllib to implement the SVM algorithm and to classify the data.
Ⅱ Experimental environment and software
1. VMware Workstation Pro 16
2. unbuntu14.04-server-amd64
3. jdk1.8
4. scala2.12.6
5. hadoop2.7.7
6. spark2.4.4-bin-hadoop2.7
Ⅲ Experimental content
Support vector machine (SVM) algorithm in classification algorithm is implemented under Spark using MLlib, and relevant data is analyzed using support vector machine.
Ⅵ Test case
Code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.optimization.L1Updater
object SVMWithSGDExample {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("SVMWithSGDExample").setMaster("local[*]")
val sc = new SparkContext(conf)
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1).cache()
// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val accuracy = 1.0 * scoreAndLabels.filter(x => x._1 == x._2).count() / test.count()
val tp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 1.0).count().toDouble
val fp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 0.0).count().toDouble
val fn = scoreAndLabels.filter(x => x._1 == 0.0 && x._2 == 1.0).count().toDouble
val precision = tp / (tp + fp)
val recall = tp / (tp + fn)
val f1Score = 2 * (precision * recall) / (precision + recall)
println(s"Accuracy: $accuracy")
println(s"Precision: $precision")
println(s"Recall: $recall")
println(s"F1-score: $f1Score")
sc.stop()
}
}
Data:
Figure 3-1 sample_libsvm_data.txt
Ⅴ Test results
Figure 3-2 Result
Accuracy=0.976744180465116
Precision:=1.0
Recall=0.9565217391304348
The F1 score=0.9777777777777777777
Reason:
The accuracy, precision, recall, and F1 score are all very high, but no false positive category predictions have been made, possibly due to imbalanced datasets, or because the number of positive class samples in the dataset far exceeds that of negative class samples.
Experimental task 4: Spark MLlib implement K-means algorithm
Ⅰ Experimental purpose and requirements
(1)Students are required to be able to understand the basic principles of cluster algorithm.
(2)Students are required to be able to understand the principles and process of K-means.
(3)Students are required to be able to implement K-means algorithm and clustering data using Mllib.
Ⅱ Experimental environment and software
1. VMware Workstation Pro 16
2. unbuntu14.04-server-amd64
3. jdk1.8
4. scala2.12.6
5. hadoop2.7.7
6. spark2.4.4-bin-hadoop2.7
Ⅲ Experimental content
The K-means algorithm in classification algorithm is implemented under Spark using MLlib and the data is clustered using the K-means algorithm.
Ⅵ Test case
Code:
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object KMeansExample {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder()
.appName("KMeansExample")
.master("local[*]")
.getOrCreate()
// Loads data.
val dataset = spark.read.format("libsvm").load("file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans()
.setK(2)
.setSeed(1L)
val model = kmeans.fit(dataset)
// Make predictions
val predictions = model.transform(dataset)
// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
spark.stop()
}
}
Data:
Figure 4-1 sample_kmeans_data.txt
Ⅴ Test results
Figure 4-2 Result
Silhouette with squared euclidean distance = 0.9997530305375207
Reason:
The clustering effect is very good, with each sample closely related to other samples in its nearest cluster and relatively distant from samples in other clusters, possibly due to the good dataset; Perhaps due to the large differences between clusters, while within clusters there may be small differences; The number and size of clusters are appropriate.
Cover Designer: Li Jia
Tel: +86-335-8057068
Website: http://jwc.ysu.edu.cn
E-mail: xsyj@ysu.edu.cn
No. 438 West Hebei Avenue,
Qinhuangdao, Hebei 066004, P. R. China