大数据实验

可以去直接下载实验报告,链接为:https://download.csdn.net/download/qq_63015047/89374537

Experimental Report of Big Data Analytics

College:
Class:
ID:
Name:
Teachers:

Office of Academic Affairs
March 2024

Experimental task 1: Hadoop and Spark construction and application development

Ⅰ Experimental purpose and requirements

(1)Students are required to be able to build Hadoop and Spark environments.
(2)Students are required to be able to start the Spark service process correctly.
(3)Students are required to be able to implement Spark applications and run them correctly.
Ⅱ Experimental environment and software

  1. VMware Workstation Pro 16
  2. unbuntu14.04-server-amd64
  3. jdk1.8
  4. scala2.12.6
  5. hadoop2.7.7
  6. spark2.4.4-bin-hadoop2.7
    Ⅲ Experimental content
    Install the Hadoop and Spark environment on a virtual machine or Linux system, and start the daemon process. Use Spark to implement and run the WordCount program.

Ⅵ Test case
Data:

Figure 1-1 input.txt
Code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.{Level, Logger}
object WordCount {
def main(args: Array[String]): Unit = {
Logger.getLogger(“org”).setLevel(Level.OFF)
val conf = new SparkConf().setAppName(“WordCount”).setMaster(“local[*]”)
val sc = new SparkContext(conf)
val input = sc.textFile(“file:/home/liqing/桌面/input.txt”)
val words = input.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
val result = wordCounts.collect()
result.foreach(println)
sc.stop()
}
}
Ⅴ Test results

Figure 1-2 visit 192.168.37.146:50070

Figure 1-3 visit 192.168.37.146:8080

Figure 1-4 Result

WordCount cannot be evaluated using Precision, Recall, and F1 score.

Experimental task 2: Spark MLlib implements linear regression algorithm

Ⅰ Experimental purpose and requirements
(1)Students are required to accurately understand the basic principles of the linear regression analysis algorithm;
(2)Students are required to be able to implement and run the basic linear regression algorithm using the MLlib;
(3)Students are required to run the linear regression algorithm to obtain the fitting curve and analyze the fitting effect.
Ⅱ Experimental environment and software

  1. VMware Workstation Pro 16
  2. unbuntu14.04-server-amd64
  3. jdk1.8
  4. scala2.12.6
  5. hadoop2.7.7
  6. spark2.4.4-bin-hadoop2.7
    Ⅲ Experimental content
    Implement the linear regression algorithm program using MLlib under Spark and be able to fit the input data set to obtain the demand regression formula. Validation of the fitted curves is performed.
    Ⅵ Test case
    Code:
    import org.apache.log4j.{Level, Logger}
    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.regression.{LabeledPoint,
    LinearRegressionWithSGD}
    import org.apache.spark.{SparkConf, SparkContext}
    object LinearRegressionExample{
    val DATA_PATH = “/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/ridge-data/lpsa.data”
    def main(args: Array[String]): Unit = {
    Logger.getLogger(“org”).setLevel(Level.OFF)
    val conf = new SparkConf().setAppName(“LinearRegressionExample”).setMaster(“local[*]”)
    val sc = new SparkContext(conf)
    val data = sc.textFile(DATA_PATH)
    val parsedData = data.map(line => {
    val parts = line.split(“,”)
    LabeledPoint(parts(0).toDouble,
    Vectors.dense(parts(1).split(" ").map(.toDouble)))})
    val numIterations = 100
    val model = LinearRegressionWithSGD.train(parsedData,numIterations)
    val valuesAndPreds = parsedData.map(point => {
    val prediction = model.predict(point.features)
    (point.label, prediction)})
    val MSE = valuesAndPreds.map {
    case (v, p) => math.pow(v - p, 2)
    }.reduce(
    + _) / valuesAndPreds.count()
    println(MSE)
    }
    }

Data:

Figure 2-1 lpsa.data
Ⅴ Test results

Figure 2-2 Result

Using MSE for evaluation,MSE=6.207597210613578
Reason:May be due to the non-linear relationship between the features of the input dataset and the target variable, or the dataset containing noise or outliers, or improper setting of numIterations and learning rates.

Experimental task 3: Spark MLlib implement support vector machine algorithm

Ⅰ Experimental purpose and requirements
(1)Students are required to understand the basic principles of the classification algorithm;
(2)Students are required to understand the classification principle of the SVM algorithm;
(3)Students are required to use the Mllib to implement the SVM algorithm and to classify the data.
Ⅱ Experimental environment and software

  1. VMware Workstation Pro 16
  2. unbuntu14.04-server-amd64
  3. jdk1.8
  4. scala2.12.6
  5. hadoop2.7.7
  6. spark2.4.4-bin-hadoop2.7
    Ⅲ Experimental content
    Support vector machine (SVM) algorithm in classification algorithm is implemented under Spark using MLlib, and relevant data is analyzed using support vector machine.
    Ⅵ Test case
    Code:
    import org.apache.spark.{SparkConf, SparkContext}
    import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
    import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
    import org.apache.spark.mllib.util.MLUtils
    import org.apache.log4j.{Level, Logger}
    import org.apache.spark.mllib.optimization.L1Updater

object SVMWithSGDExample {

def main(args: Array[String]): Unit = {
Logger.getLogger(“org”).setLevel(Level.OFF)
val conf = new SparkConf().setAppName(“SVMWithSGDExample”).setMaster(“local[*]”)
val sc = new SparkContext(conf)

// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1).cache()

// Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)

// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)
}

val accuracy = 1.0 * scoreAndLabels.filter(x => x._1 == x._2).count() / test.count()

val tp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 1.0).count().toDouble
val fp = scoreAndLabels.filter(x => x._1 == 1.0 && x._2 == 0.0).count().toDouble
val fn = scoreAndLabels.filter(x => x._1 == 0.0 && x._2 == 1.0).count().toDouble
val precision = tp / (tp + fp)
val recall = tp / (tp + fn)
val f1Score = 2 * (precision * recall) / (precision + recall)
println(s"Accuracy: $accuracy")
println(s"Precision: $precision")
println(s"Recall: $recall")
println(s"F1-score: $f1Score")
sc.stop()

}
}
Data:

Figure 3-1 sample_libsvm_data.txt

Ⅴ Test results

Figure 3-2 Result
Accuracy=0.976744180465116
Precision:=1.0
Recall=0.9565217391304348
The F1 score=0.9777777777777777777
Reason:
The accuracy, precision, recall, and F1 score are all very high, but no false positive category predictions have been made, possibly due to imbalanced datasets, or because the number of positive class samples in the dataset far exceeds that of negative class samples.

Experimental task 4: Spark MLlib implement K-means algorithm

Ⅰ Experimental purpose and requirements
(1)Students are required to be able to understand the basic principles of cluster algorithm.
(2)Students are required to be able to understand the principles and process of K-means.
(3)Students are required to be able to implement K-means algorithm and clustering data using Mllib.
Ⅱ Experimental environment and software

  1. VMware Workstation Pro 16
  2. unbuntu14.04-server-amd64
  3. jdk1.8
  4. scala2.12.6
  5. hadoop2.7.7
  6. spark2.4.4-bin-hadoop2.7
    Ⅲ Experimental content
    The K-means algorithm in classification algorithm is implemented under Spark using MLlib and the data is clustered using the K-means algorithm.
    Ⅵ Test case
    Code:
    import org.apache.spark.ml.clustering.KMeans
    import org.apache.spark.ml.evaluation.ClusteringEvaluator
    import org.apache.spark.sql.SparkSession
    import org.apache.log4j.{Level, Logger}
    object KMeansExample {
    def main(args: Array[String]): Unit = {
    Logger.getLogger(“org”).setLevel(Level.OFF)
    val spark = SparkSession.builder()
    .appName(“KMeansExample”)
    .master(“local[*]”)
    .getOrCreate()
    // Loads data.
    val dataset = spark.read.format(“libsvm”).load(“file:/usr/local/spark/spark-2.4.4-bin-hadoop2.7/data/mllib/sample_kmeans_data.txt”)
    // Trains a k-means model.
    val kmeans = new KMeans()
    .setK(2)
    .setSeed(1L)
    val model = kmeans.fit(dataset)
    // Make predictions
    val predictions = model.transform(dataset)
    // Evaluate clustering by computing Silhouette score
    val evaluator = new ClusteringEvaluator()
    val silhouette = evaluator.evaluate(predictions)
    println(s"Silhouette with squared euclidean distance = $silhouette")
    // Shows the result.
    println("Cluster Centers: ")
    model.clusterCenters.foreach(println)
    spark.stop()
    }
    }

Data:

Figure 4-1 sample_kmeans_data.txt

Ⅴ Test results

Figure 4-2 Result
Silhouette with squared euclidean distance = 0.9997530305375207
Reason:
The clustering effect is very good, with each sample closely related to other samples in its nearest cluster and relatively distant from samples in other clusters, possibly due to the good dataset; Perhaps due to the large differences between clusters, while within clusters there may be small differences; The number and size of clusters are appropriate.

Cover Designer: Li Jia

Tel: +86-335-8057068
Website: http://jwc.ysu.edu.cn
E-mail: xsyj@ysu.edu.cn
No. 438 West Hebei Avenue,
Qinhuangdao, Hebei 066004, P. R. China

  • 12
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值