读取文件后进行PCA操作

最新推荐文章于 2024-09-11 18:01:31 发布

清婉若君

最新推荐文章于 2024-09-11 18:01:31 发布

阅读量292

点赞数 8

文章标签： python

本文链接：https://blog.csdn.net/qingwanruojun/article/details/138294887

版权

本文介绍了如何在Python中使用主成分分析（PCA）处理大型数据，包括数据读取、归一化步骤以及使用sklearn库实现PCA的详细过程。

摘要由CSDN通过智能技术生成

遇到需要拟合若干数据到几个主要元素的问题时，我们要采取PCA（主成分分析）来解决这个问题，首先我们要将若干数据读取到python中，代码如下：（代码中有详细的注释帮助理解）

import openpyxl

# Define a list of filenames
filenames = []

# Define transportation modes corresponding to each number
excel_file_names = {1: 'walk',
                    2: 'car',
                    3: 'run',
                    4: 'scooter',
                    5: 'bike',
                    6: 'tramway',
                    7: 'bus',
                    8: 'train'}

# Counter for assigning transportation modes
cnt = 1

# Iterate over the list of filenames
for name in filenames:
    # Create a new Excel workbook
    wb = openpyxl.Workbook()

    # Create a worksheet named 'data'
    ws = wb.create_sheet('data')

    # Open the text file
    with open(f'E:/python_analyse_data/multimodal_transport_analytics/Collecty_data/{name}.txt', 'r') as data_file:
        # Read the contents of the data file line by line
        contents = data_file.readline()
        while contents != '':
            # Split each line by comma and convert it into a list
            xlsx_contents = contents.strip().split(',')
            print(xlsx_contents)

            # Append the data to the Excel worksheet
            ws.append(xlsx_contents)

            # Read the next line of the data file
            contents = data_file.readline()

    # Get the corresponding Excel filename based on the counter
    excel_file_name = excel_file_names[cnt]

    # Save the workbook as an Excel file
    wb.save(f'data_to_xlsx{excel_file_name}.xlsx')

    # Increment the counter for the next transportation mode
    cnt += 1

然后我们要用到主成分分析法，将若干数据变成几个主要数据（我们题中是三个主要数据），首先先对数据进行归一化，再用python中自带的PCA库解决问题，代码如下（代码中有详细的注释便于理解）：

import numpy as np

# Sample matrix
M = []

# Find the minimum and maximum values for each column
min_vals = np.min(M, axis=0)  # Find the minimum value for each column
max_vals = np.max(M, axis=0)  # Find the maximum value for each column

# Normalization
normalized_matrix = (M - min_vals) / (max_vals - min_vals)

from sklearn.decomposition import PCA

data = []

# Perform Principal Component Analysis (PCA)
pca = PCA()
pca.fit(data)

# Get the principal component loading matrix
coeff = pca.components_

# Get the projected data matrix
score = pca.transform(data)

# Get the eigenvalues of the principal components
latent = pca.explained_variance_

# Get the explained variance ratio for each principal component
explained = pca.explained_variance_ratio_

# Output the explained variance ratio for the first few principal components
num_components = 3  # Assume output for the first three principal components
print('Explained variance ratio of the first', num_components, 'components:')
print(explained[:num_components])

这就是如何对大型数据进行数据的抽离和进行PCA了，希望本博客能对你有所帮助！