pca降维分类_降维-PCA是否可以改善分类模型的性能？

最新推荐文章于 2024-07-26 13:24:37 发布

weixin_26752075

最新推荐文章于 2024-07-26 13:24:37 发布

阅读量1.9k

点赞数

文章标签：人工智能 java 机器学习 python 深度学习

原文链接：https://towardsdatascience.com/dimensionality-reduction-can-pca-improve-the-performance-of-a-classification-model-d4e34194c544

版权

pca降维分类

什么是PCA？ (What is PCA?)

Principal Component Analysis (PCA) is a common feature extraction technique in data science that employs matrix factorization to reduce the dimensionality of data into lower space.

主成分分析(PCA)是数据科学中的一种常见特征提取技术，该技术采用矩阵分解来减少数据进入较低空间的维数。

In real-world datasets, there are often too many features in the data. The higher the number of features harder it is to visualize the data and work on it. Sometimes most of the features are correlated, and hence redundant. Hence feature extraction comes into play.

在现实世界的数据集中，数据中通常有太多特征。功能数量越多，就越难以可视化数据并对其进行处理。有时大多数功能是相关的，因此是多余的。因此，特征提取开始起作用。

关于数据： (About the Data:)

The dataset used in this article is Ionosphere Dataset from the UCI machine learning repository. It is a binary class classification problem. There are 351 observations with 34 features.

本文中使用的数据集是UCI机器学习存储库中的Ionosphere数据集。这是一个二进制类分类问题。有351个观测结果，具有34个特征。

准备数据集： (Preparing the Dataset:)

Importing necessary libraries and reading the dataset
导入必要的库并读取数据集
Preprocessing of dataset
数据集的预处理
Standardization
标准化

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
rom sklearn.model_selection import train_test_split


data = pd.read_csv("ionosphere.csv", header=None)


X = data.iloc[:,:-1]
y = data.iloc[:,-1]


y = [1 if x=='g' else 0 for x in y]
y = np.reshape(y, (len(y), 1))


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


std = StandardScaler()
X_train_std = std.fit_transform(X_train)
X_test_std = std.transform(X_test)


y_train = np.reshape(y_train, (y_train.shape[0]))
y_test = np.reshape(y_test, (y_test.shape[0]))


y_train = y_train.astype('int')
y_test = y_test.astype('int')

使用所有34个功能的Logistic回归ML模型： (Logistic Regression ML model using all 34 features:)

The training data has 34 features.

训练数据具有34个特征。

After preprocessing of data, training data is trained using Logistic Regression algorithm for binary class classification
在对数据进行预处理之后，使用Logistic回归算法对训练数据进行训练，以进行二分类分类
Finetuning Logistic Regression model to find the best parameters
微调Logistic回归模型以找到最佳参数
Compute training and test accuracy and f1 score.
计算训练和测试的准确性以及f1分数。

Image for post — (Image by Author), Plot of C vs F1 score for the logistic regression model for 34 features dataset

Training

最低0.47元/天解锁文章

weixin_26752075

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
pca降维分类_降维-PCA是否可以改善分类模型的性能？

pca降维分类什么是PCA？ (What is PCA?)Principal Component Analysis (PCA) is a common feature extraction technique in data science that employs matrix factorization to reduce the dimensionality of data into l...
复制链接

扫一扫