新闻聚类和新闻分类hadoop+spark（燕山大学大数据三级项目）-CSDN博客

本文链接：https://blog.csdn.net/qq_63015047/article/details/139306372

由于上传到csdn结构比较混乱，可以进我的主页查看相应的资源，可以下载

【免费】新闻聚类+新闻分类（hadoop+spark+scala）资源-CSDN文库

Abstract

This project aims at using Bayesian classification algorithm and K-Means algorithm to classify and cluster news. We use text data, including news headlines, content, etc., and use feature engineering and machine learning algorithms to automate the classification and clustering of news. By comparing the experimental results, we find that different algorithms have their own advantages and applicable scenarios in news classification. Bayesian classification algorithms are suitable for simple scenarios and small-scale data sets, K-Means algorithms are suitable for scenarios where complex large-scale data sets are processed and high accuracy obtained. The research results of this project have important theoretical and practical significance for the realization of automated news classification and clustering.

Keywords: Big Data Analytics; Bayesian classification algorithm; K-Means algorithm

Content

Abstract

1 Preface

1.1 Research background

1.2 Purpose and significance of research

1.3 Research status in related fields

2 The principle and method of realizat

2.1 The process of machine learning

2.1.1 Concept of Machine Learning

2.1.2 Three processes of mechanical learnin

2.2 machine learning algorithm

2.2.1 K-Means Algorithm

2.2.2 Naive Bayes Algorithm

3 General Design

3.1 Introduction to development tools

3.1.1 VMware Workstation

3.1.2 Hadoop

3.1.3 Spark

3.2 Feature Overview

3.2.1 Data Processing

3.2.2 Dataset Division

3.2.3 Model training

4 Detailed Design

4.1 Data Processing

4.2 Concrete implementation of news clustering

4.3 Concrete implementation of news classification

5 Problems encountered and solutions

6 Conclusion

Reference

1 Preface

1. Research background

With the rapid development of the Internet and the explosive growth of information, the amount of news that people are exposed to in their daily lives has increased dramatically. In the face of massive news data, how to efficiently classify and cluster has become an important challenge. Traditional manual classification methods require a lot of labor and time costs, and are inefficient when dealing with large-scale data. Therefore, the automated classification and clustering of news using machine learning algorithms becomes a more attractive solution.

As a classical probabilistic statistical classification method, Bayes classification algorithm has a good application effect in the field of text classification. The basic principle is to classify text features by calculating the conditional probability according to Bayes theorem, which is simple and easy to understand and fast to calculate.

K-Means algorithm also has a good application effect，with the development of the Internet, people can easily get a lot of news information, but it also leads to the problem of information overload. In order to help users manage and browse this information more effectively, automated news clustering methods are needed.

Based on the above background, this project aims to research the performance of Bayesian classification algorithm and K-Means algorithm in news classification and clustering tasks, so as to provide theoretical and practical support for automated news classification and clustering.

1.2 Purpose and significance of research

Purpose:

1. Explore the performance of different algorithms: The primary goal of this project is to explore the performance of Bayesian classification algorithm, K-Means algorithm in news classification and clustering tasks, and find out which algorithm or algorithms have better performance under certain conditions.

2. Improve the efficiency of news classification: Through automated news classification and clustering, the efficiency of news management and retrieval can be greatly improved. The project aims to use these machine learning algorithms to provide an automated solution to quickly classify large amounts of news data.

3. Enhance the ability of news analysis: Studying the application of these algorithms on news data can provide technical support for topics analysis and hot spot tracking of news content, so as to better understand news trends and public concerns.

Significance:

1. Support the automation and intelligence of the news media industry: Automated news classification and clustering will help the news media industry organize and manage content more effectively, improve work efficiency and reduce labor costs.

2. Promote the improvement of user experience: When news content can be accurately classified and organized, users can find news of interest more easily and enhance their reading experience. This has important commercial significance for news websites and news applications.

3. Provide research data and cases: The research results of the project can provide valuable data and cases for academic research, and help other researchers to further explore the application of machine learning in the field of journalism and media.

4. Facilitate access to and dissemination of information: Through automated news classification, access to and dissemination of information will become faster and more accurate. This has a positive impact on the flow of information and decision-making in social, economic, political and other aspects.

1.3 Research status in related fields

1. Bayesian classification algorithm:

Bayesian classification algorithm has a long history in the field of text classification and has shown good results in practice. Researchers continue to improve and optimize Bayesian classification algorithms to adapt them to different application scenarios. Some researches are devoted to improving the performance of Bayes classification algorithm, such as improving classification accuracy and generalization ability by introducing feature selection, model tuning and other methods. In the field of news, Bayesian classification algorithms are often used in tasks such as topic classification and sentiment analysis, providing strong support for the automatic processing of news content.

K-Means algorithm:

The k-means algorithm is a common clustering algorithm used to divide a data set into K different groups or clusters. In the research background of news clustering, K-Means algorithm is widely used to effectively classify and organize a large number of news documents, so that users can browse and obtain the required information more easily. News clustering can not only help users organize and browse news, but can also be used to mine potential patterns and trends in news data. This is of great significance to the fields of news editing, market analysis and public opinion monitoring.

2 The principle and method of realizat

2.1 The process of machine learning

2.1.1 Concept of Machine Learning

Machine learning has many definitions, but the core idea is to use algorithms and statistics to allow computers to learn from data and improve their performance automatically. Specifically, machine learning algorithms build a mathematical model based on sample data (called "training data") in order to make predictions or decisions without being explicitly programmed to perform the task.

It focuses on how to simulate or implement human learning behavior in order to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve its performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent.

2.1.2 Three processes of mechanical learnin

1 Data preparation

This is the first and crucial step in machine learning. In this phase, a dataset relevant to the problem needs to be collected and preprocessed. Data preprocessing may include data cleaning (such as removing missing values, outliers, duplicate values, etc.), data transformation (such as text vectorization, normalization, standardization, etc.), feature extraction or selection (selecting features that are helpful for model prediction), etc.

The goal of data preparation is to provide a model with clean, complete, and consistent data in order to improve the accuracy and reliability of the model.

2 Model training

In this phase, the prepared data is used to build and train the machine learning model. The goal of model training is to be able to predict and classify new unknown data by learning patterns and regularities in the data.

The training process involves selecting an appropriate algorithm and model architecture and iteratively optimizing the parameters to improve the performance of the model. During training, the model continuously adjusts its internal parameters based on the input data to minimize prediction error.

Common machine learning algorithms include decision trees, support vector machines, neural networks, etc.

3 Model evaluation and deployment

After the model is trained, the performance of the model needs to be evaluated using a portion of the data that was not used in the model training (the test set). The evaluation metrics can be accuracy, precision, recall, F1 score, AUC-ROC, etc.

By evaluating the results, the model can be judged good or bad, and further optimization can be carried out. If the model performs poorly, it may be necessary to adjust the model parameters, try different algorithms, or improve the data preprocessing steps.

Once the model performance meets the requirements, it can be deployed in real applications for processing new unknown data and producing prediction results.

2.2 machine learning algorithm

2.2.1 K-Means Algorithm

The K-means algorithm is a widely used unsupervised learning algorithm for partitioning data points into K clusters (or clusters). The algorithm iteratively assigns data points to the nearest cluster centers and updates these centers until some stopping criterion is met (e.g., the cluster centers no longer change significantly or the maximum number of iterations is reached).

The key idea of K-Means is to partition data points based on distance (usually Euclidean distance) such that the sum of the distances between each data point and its cluster center is minimized. This partition makes data points within the same cluster as similar as possible, and data points within different clusters as different as possible.

Means algorithm has been widely used in many fields because of its simplicity, high computational efficiency, and easy implementation.

2.2.2 Naive Bayes Algorithm

Naive Bayes is a classification algorithm based on probability theory. It is based on Bayes theorem, and classifies samples by calculating the probability of belonging to different classes. The core assumption of such algorithms is conditional independence between features, that is, the occurrence of one feature does not depend on the occurrence of other features.

Naive Bayes classifier originates from classical mathematical theory, which has a solid mathematical foundation and stable classification efficiency. At the same time, it needs to estimate few parameters, is not sensitive to missing data, and the algorithm is relatively simple. In theory, the naive Bayes model has the smallest error rate compared with other classification methods, but in practice, this assumption is often not true because the features may not be completely independent, which may affect the classification results to some extent.

Naive Bayes algorithm is widely used in text classification, spam identification, sentiment analysis and other fields. When dealing with text data, the naive Bayes algorithm usually assumes that the frequencies of words in the text are independent of each other. Based on this assumption, the algorithm can calculate the probabilities of the text belonging to different classes and select the class with the largest probability as the classification result.

3 General Design

3.1 Introduction to development tools

3.1.1 VMware Workstation

VMware Workstation Pro is a powerful desktop virtualization software, which allows users to run different operating systems on a single desktop at the same time, providing the best solution for developing, testing and deploying new applications.

Multi-operating system support: VMware Workstation Pro supports running Windows, Linux, and BSD virtual machines on a Windows or Linux desktop, allowing users to easily work in different operating system environments.

Snapshots and clones: Users can create snapshots of a virtual machine to save the state of the virtual machine at a point in time. This makes it easy to roll back to a previous state when experimenting, testing, or developing. At the same time, users can also clone virtual machines to quickly create similar virtual machines.

Advanced network configuration: VMware Workstation provides flexible network configuration options, users can simulate different network environments, network testing and development.

Virtualized hardware support: Software supports virtualized hardware, providing good support for virtualization technologies, including the utilization of hardware virtualization extensions such as Intel VT-x and AMD-V.

Shared virtual machines: Users can package and share virtual machines with other users who use VMware Workstation to facilitate the distribution and deployment of virtual environments.

Application Scenarios: VMware Workstation Pro gives IT professionals the ability to design, test, and demonstrate software solutions for virtually any device, platform, or cloud environment.

3.1.2 Hadoop

It provides HDFS (Hadoop Distributed File System) as a distributed file system and the MapReduce programming framework to process the data stored in HDFS. Hadoop has the characteristics of high reliability, high scalability, high efficiency and high fault tolerance, and can process petabytes of data. It allows users to develop distributed programs without knowing the low-level details of the distribution, and make full use of the power of the cluster for high-speed computing and storage.

3.1.3 Spark

Spark is a big data processing framework designed to process large amounts of data quickly. Based on in-memory computing, it can significantly improve the performance of data processing. Spark provides a rich set of apis and libraries that make it easy to build complex data processing and analysis tasks.

At the core of Spark is Resilient Distributed Datasets, a distributed data abstraction that automatically performs fault-tolerance and parallel computation.

In addition, Spark provides a variety of optimization strategies, such as RDD partitioning, caching, and broadcasting, to further improve data processing efficiency and performance.

3.2 Feature Overview

3.2.1 Data Processing

Use OS library traversal to open all files in UTF-8 mode, store them in a string array, and then store the data in a dataframe. Data preprocessing consists of three processes: text filtering, word segmentation, and removal of stop words:

Use regular expressions to filter out punctuation marks, etc

2.Use Ansj word segmentation tool, split continuous Chinese text into independent words.

3. Use TF-IDF to extract and select the features of the data array. Calculates the importance of a word in the overall corpus based on the number of times the word appears in the text and the frequency of the document in the overall corpus .

3.2.2 Dataset Division

In order to evaluate the performance of the trained model, the dataset needs to be divided into two parts, one is the training dataset for the training of the model, and the other is the test dataset for the evaluation of the model.

3.2.3 Model training

1 K-Means

Kmeans algorithm, the principle of which is to divide the dataset into k clusters such that each data point belongs to the nearest cluster, and the center of the cluster is the average of all data points. This algorithm is based on iterative optimization, where each iterative step updates the center point of the cluster until the convergence condition is reached.

Figure 3-1 K-Means process

2 Algorithm-Native Bayes

1.Compute a prior for each class (the probability of each class occurring), given the sample data and labels.

2.For each feature, compute the conditional probability (the probability of a feature in a given class) for each class.

3.Calculate the posterior probability (the probability of belonging to each class given the observed feature values) from Bayes' theorem.

4.The class with the maximum a posteriori probability is selected as the classification result of the sample.

Figure 3-2 Naive Bayes algorithm

4 Detailed Design

4.1 Data Processing

We process the data based on the dataset provided by the project.Data processing consists of the following steps:

Read file:read dataset from the memory

Participle:Ansj word segmentation tool, split continuous Chinese text into independent words.

Feature extraction: TF-IDF for feature extraction.

Flow chart:

Figure 4-1 Data processing process

Results：

Figure 4-2 Results of data processing

Code：

Use the word segmentation tool Ansj to tokenize the entered lines of text and filter out some stop words and words with specific parts of speech.

def tokenizer2(line: String): Seq[String] = {

AnsjSegment(line)

.split(",")

.filter(_ != null)

.toSeq

}

def AnsjSegment(line: String): String = {

val StopWords = List("的", "是", "了")

val KeepNatures = List("n", "v", "a", "m", "t") // 不要r,w

val words = ToAnalysis.parse(line)

val word = ArrayBuffer[String]()

for (i <- Range(0, words.size())) {

if (KeepNatures.contains(words.get(i).getNatureStr.substring(0, 1)) && words.get(i).getName.length()>=2&& !StopWords.contains(words.get(i).

getName))

word += words.get(i).getName

}

word.mkString(",")

}

The text is first converted to a word frequency vector using HashingTF, then the word frequency is scaled using IDF, and finally the result is converted into a tuple pair of vectors and labels for further use in machine learning algorithms.

val hashingTF = new HashingTF().setInputCol("sentence").setOutputCol("rawFeatures").setNumFeatures(1000)

val featuredData = hashingTF.transform(dataFrame)

featuredData.show(5) println("*****featured*****DataCount*****=" + featuredData.count())

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

val idfModel = idf.fit(featuredData)

val rescaledData = idfModel.transform(featuredData)

rescaledData.show(5)

val value = rescaledData.select("features","label").rdd.map(row => {

val vector = fromML(row.get(0).asInstanceOf[Vector])

val label = row.get(1)

(vector,label)

}

4.2 Concrete implementation of news clustering

We use the kmeans algorithm to implement news clustering.First, a model is trained using the KMeans algorithm and saved.

Code:

val clusters: KMeansModel = KMeans.train(value.map(_._1), 9, 30, 3)

clusters.save(sc, "/target/org/apache/spark/KMeansExample/KMeansModel")

Output the number of clusters and centroid information:

println("Cluster Number:" + clusters.clusterCenters.length)

println("Cluster Centers Information Overview:")

var clusterIndex: Int = 0

clusters.clusterCenters.foreach(x => {

println("聚类质心点向量：" + clusterIndex + ":")

println(x)

clusterIndex += 1

})

The cost of KMeans clustering (the sum of the squares of the distances from all points to their nearest centroids) is calculated using the computeCost method. This can be used to evaluate the performance of the model.

Code:

val kMeansCost = clusters.computeCost(value.map(_._1))

println("K-Means Cost:"+ kMeansCost)

The KMeans model is used to predict each data point and save the clustering results.

Code:

value.map(x=>{

(clusters.predict(x._1)+":"+x._2.toString)

}).saveAsTextFile("file:~/桌面/kmeansresult")

Results：

Figure 4-3 Output a vector of centroids

Figure 4-4 Output kmeans cost

Figure 4-5 Predicted results

4.3 Concrete implementation of news classification

We use the Naive Bayesian algorithm for news classification.

First, the dataset was randomly divided into a training set (70%) and a test set (30%), and a fixed seed value of 1234L was used to ensure consistent results for each run.

val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)

Create a NaiveBayes model and train it with training data.

val model = new NaiveBayes()

.fit(trainingData)

The trained model is used to make predictions on the test data and display the prediction results for the first 2500 rows.

val predictions = model.transform(testData)

predictions.show(2500)

The accuracy of the test set is calculated using the MulticlassClassificationEvaluator evaluator and the results are output.

val evaluator = new MulticlassClassificationEvaluator()

.setLabelCol("label")

.setPredictionCol("prediction")

.setMetricName("accuracy")

val accuracy = evaluator.evaluate(predictions)

println(s"Test set accuracy = $accuracy")

Results：

Figure 4-6 Output the prediction results

Figure 4-7 Output accuracy

5 Problems encountered and solutions

Use of Ansj tokenizer: Using Ansj tokenizer may require some configuration and understanding, and correctly calling Ansj's API for tokenization may be a challenge.

Solution: Read Ansj's official documentation and sample code to understand how to use Ansj Splitter correctly for word segmentation.

Stopping words and part of speech filtering: The code filters the segmentation results using stopping words and part of speech, which involves some basic knowledge in the field of natural language processing. It is necessary to understand these concepts and implement the filtering logic correctly.

Solution: Understand the concepts of stopping words and part of speech filtering, and adjust the stopping word list and part of speech filtering logic according to actual needs to ensure that keywords that meet the requirements are filtered out.

3. Feature quantity setting: When using HashingTF, a feature quantity of 1000 is set, which needs to be reasonably set based on the actual number of features and memory limitations of the dataset, and may require debugging and optimization.

Solution: Considering the actual situation and performance requirements, set the number of features reasonably based on the size and dimension of the dataset. It can be optimized through methods such as cross validation.

6 Conclusion

This paper describes the basic principles and implementation methods of data preprocessing, Chinese word segmentation, feature extraction, word vectography, data standardization, data set partitioning, model training and model evaluation. We need to collect news data and preprocess it, including text cleaning, word segmentation, removing stop words and other operations, so as to facilitate subsequent algorithm processing. Next, we use TF-IDF and other methods to extract features from the text for the input of the algorithm. Then, we tried a variety of algorithms such as Bayes an classification algorithm, K-Means algorithm to choose the most suitable model for our task. Train the selected model and adjust the parameters to improve the classification performance. We evaluated the trained model and used cross-validation and other methods to verify its generalization ability.

Through the implementation of this project, we have a deep understanding of the principles and application scenarios of machine learning algorithms such as Bayesian classification algorithm, K-Means algorithm. During the implementation of the project, we encountered various challenges and problems, such as poor data quality, unsatisfactory model performance, etc. By solving these problems, we improved our problem-solving ability and practical ability.

Reference

[1] Bill, Chambers, Matei, Zaharia. Spark Authoritative Guide [M]. China Electric Power Press,2020.4

[2] TANG Yu, Lin Di. Big Data Analysis and Computing [M]. Beijing: Tsinghua University Press, 2018.3.