大数据实验

最新推荐文章于 2024-08-27 13:10:33 发布

哇哇哈哈哈哈哈

最新推荐文章于 2024-08-27 13:10:33 发布

阅读量759

点赞数 12

分类专栏：燕山大学各种作业文章标签：大数据

本文链接：https://blog.csdn.net/qq_63015047/article/details/139305431

版权

燕山大学各种作业专栏收录该内容

12 篇文章 1 订阅

订阅专栏

可以去直接下载实验报告，链接为：https://download.csdn.net/download/qq_63015047/89374537

Experimental Report of Big Data Analytics

College：
Class：
ID：
Name：
Teachers：

Office of Academic Affairs
March 2024

Experimental task 1: Hadoop and Spark construction and application development

Ⅰ Experimental purpose and requirements

（1）Students are required to be able to build Hadoop and Spark environments.
（2）Students are required to be able to start the Spark service process correctly.
（3）Students are required to be able to implement Spark applications and run them correctly.
Ⅱ Experimental environment and software

VMware Workstation Pro 16
unbuntu14.04-server-amd64
jdk1.8
scala2.12.6
hadoop2.7.7
spark2.4.4-bin-hadoop2.7
Ⅲ Experimental content
Install the Hadoop and Spark environment on a virtual machine or Linux system, and start the daemon process. Use Spark to implement and run the WordCount program.

Ⅵ Test case
Data:

Figure 1-1 input.txt
Code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.{Level, Logger}
object WordCount {
def main(args: Array[String]): Unit = {
Logger.getLogger(“org”).setLevel(Level.OFF)
val conf = new SparkConf().setAppName(“WordCount”).setMaster(“local[*]”)
val sc = new SparkContext(conf)
val input = sc.textFile(“file:/home/liqing/桌面/input.txt”)
val words = input.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
val result = wordCounts.collect()
result.foreach(println)
sc.stop()
}
}
Ⅴ Test results

Figure 1-2 visit 192.168.37.146:50070

Figure 1-3 visit 192.168.37.146:8080

Figure 1-4 Result

WordCount cannot be evaluated using Precision, Recall, and F1 score.

Experimental task 2: Spark MLlib implements linear regression algorithm

Ⅰ Experimental purpose and requirements
（1）Students are required to accurately understand the basic principles of the linear regression analysis algorithm；
（2）Students are required to be able to implement and run the basic linear regression algorithm using the MLlib；
（3）Students are required to run the linear regression algorithm to obtain the fitting curve and analyze the fitting effect.
Ⅱ Experimental environment and software

VMware Workstation Pro 16
unbuntu14.04-server-amd64
jdk1.8
scala2.12.6
hadoop2.7.7
spark2.4.4-bin-hadoop2.7
Ⅲ Experimental content
Implement the linear regression algorithm program using MLlib under Spark and be able to fit the input data set to obtain the demand regression formula. Validation of the fitted curves is performed.
Ⅵ Test case
Code:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint,
LinearRegressionWithSGD}
import org.apache.spark.{SparkConf, SparkContext}
object LinearRegressionExample{
val DATA_PATH = “/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/ridge-data/lpsa.data”
def main(args: Array[String]): Unit = {
Logger.getLogger(“org”).setLevel(Level.OFF)
val conf = new SparkConf().setAppName(“LinearRegressionExample”).setMaster(“local[*]”)
val sc = new SparkContext(conf)
val data = sc.textFile(DATA_PATH)
val parsedData = data.map(line => {
val parts = line.split(“,”)
LabeledPoint(parts(0).toDouble,
Vectors.dense(parts(1).split(" ").map(.toDouble)))})
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData,numIterations)
val valuesAndPreds = parsedData.map(point => {
val prediction = model.predict(point.features)
(point.label, prediction)})
val MSE = valuesAndPreds.map {
case (v, p) => math.pow(v - p, 2)
}.reduce( + _) / valuesAndPreds.count()
println(MSE)
}
}

Data:

Figure 2-1 lpsa.data
Ⅴ Test results

Figure 2-2 Result

Using MSE for evaluation,MSE=6.207597210613578
Reason:May be due to the non-linear relationship between the features of the input dataset and the target variable, or the dataset containing noise or outliers, or improper setting of numIterations and learning rates.

Experimental task 3: Spark MLlib implement support vector machine algorithm

Ⅰ Experimental purpose and requirements
（1）Students are required to understand the basic principles of the classification algorithm；
（2）Students are required to understand the classification principle of the SVM algorithm；
（3）Students are required to use the Mllib to implement the SVM algorithm and to classify the data.
Ⅱ Experimental environment and software

VMware Workstation Pro 16
unbuntu14.04-server-amd64
jdk1.8
scala2.12.6
hadoop2.7.7
spark2.4.4-bin-hadoop2.7
Ⅲ Experimental content
Support vector machine (SVM) algorithm in classification algorithm is implemented under Spark using MLlib, and relevant data is analyzed using support vector machine.
Ⅵ Test case
Code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.optimization.L1Updater

object SVMWithSGDExample {

def main(args: Array[String]): Unit = {
Logger.getLogger(“org”).setLevel(Level.OFF)
val conf = new SparkConf().setAppName(“SVMWithSGDExample”).setMaster(“local[*]”)
val sc = new SparkContext(conf)

// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1).cache()

// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)

// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)
}

val accuracy = 1.0 * scoreAndLabels.filter(x => x._1 == x._2).count() / test.count()

val tp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 1.0).count().toDouble
val fp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 0.0).count().toDouble
val fn = scoreAndLabels.filter(x => x._1 == 0.0 && x._2 == 1.0).count().toDouble
val precision = tp / (tp + fp)
val recall = tp / (tp + fn)
val f1Score = 2 * (precision * recall) / (precision + recall)
println(s"Accuracy: $accuracy")
println(s"Precision: $precision")
println(s"Recall: $recall")
println(s"F1-score: $f1Score")
sc.stop()

}
}
Data:

Figure 3-1 sample_libsvm_data.txt

Ⅴ Test results

Figure 3-2 Result
Accuracy=0.976744180465116
Precision:=1.0
Recall=0.9565217391304348
The F1 score=0.9777777777777777777
Reason:
The accuracy, precision, recall, and F1 score are all very high, but no false positive category predictions have been made, possibly due to imbalanced datasets, or because the number of positive class samples in the dataset far exceeds that of negative class samples.

Experimental task 4: Spark MLlib implement K-means algorithm

Ⅰ Experimental purpose and requirements
（1）Students are required to be able to understand the basic principles of cluster algorithm.
（2）Students are required to be able to understand the principles and process of K-means.
（3）Students are required to be able to implement K-means algorithm and clustering data using Mllib.
Ⅱ Experimental environment and software

VMware Workstation Pro 16
unbuntu14.04-server-amd64
jdk1.8
scala2.12.6
hadoop2.7.7
spark2.4.4-bin-hadoop2.7
Ⅲ Experimental content
The K-means algorithm in classification algorithm is implemented under Spark using MLlib and the data is clustered using the K-means algorithm.
Ⅵ Test case
Code:
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object KMeansExample {
def main(args: Array[String]): Unit = {
Logger.getLogger(“org”).setLevel(Level.OFF)
val spark = SparkSession.builder()
.appName(“KMeansExample”)
.master(“local[*]”)
.getOrCreate()
// Loads data.
val dataset = spark.read.format(“libsvm”).load(“file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt”)
// Trains a k-means model.
val kmeans = new KMeans()
.setK(2)
.setSeed(1L)
val model = kmeans.fit(dataset)
// Make predictions
val predictions = model.transform(dataset)
// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
spark.stop()
}
}

Data:

Figure 4-1 sample_kmeans_data.txt

Ⅴ Test results

Figure 4-2 Result
Silhouette with squared euclidean distance = 0.9997530305375207
Reason:
The clustering effect is very good, with each sample closely related to other samples in its nearest cluster and relatively distant from samples in other clusters, possibly due to the good dataset; Perhaps due to the large differences between clusters, while within clusters there may be small differences; The number and size of clusters are appropriate.

Cover Designer: Li Jia

Tel: +86-335-8057068
Website: http://jwc.ysu.edu.cn
E-mail: xsyj@ysu.edu.cn
No. 438 West Hebei Avenue,
Qinhuangdao, Hebei 066004, P. R. China

哇哇哈哈哈哈哈

关注

12
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
大数据实验

可以去直接下载实验报告，链接为：https://download.csdn.net/download/qq_63015047/89374537College：Class：ID：Name：Teachers：March 2024。
复制链接

扫一扫