行路难，行路难；多歧路，今安在？-CSDN博客

本文链接：https://blog.csdn.net/m0_66825091/article/details/143952420

本人的首次Python实践任务，花了一晚上4个小时时间，将本该一步完成的任务绕了一大圈，最后还是剩个Warning去不掉。。。

内容是大学课程大作业，要求用Forward Selection预测材料性能，并提供了github教程GitHub - chris-santiago/steps: A SciKit-Learn style feature selector using best subsets and stepwise regression.

import pandas as pd
from steps.forward import ForwardSelector

selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(X, y)
X.loc[:, selector.best_support_]

1.正确答案

代码就三行。模型中X，y应分别为训练集train_values和目标函数train_labels。但正确的代码应改写为：

from steps.forward import ForwardSelector

selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(train_values, train_labels)
train_values[:, selector.best_support_]

可以看出唯一的区别是删去了.loc。

2.错误示范

其实ForwardSelector库本身就可以接受Numpy array或者DataFrame的输入。但本人就在这前一天学了DataFrame的使用，并“敏锐”地在课件前面的dataset getting中找到了df

# Pandas Dataframe
all_labels = df['density_of_solid'].tolist()
df = df.drop(['density_of_solid'], axis=1)

df.head(n=10) # With this line you can see the first ten entries of our database

于是理所当然地认为这个库的输入为DataFrame.当然，运行之后会有这样的报错：

AttributeError                            Traceback (most recent call last)
<ipython-input-32-dbad8d7d0bd2> in <module>
      2 selector = ForwardSelector(normalize=True, metric='aic')
      3 selector.fit(train_values, train_labels)
----> 4 train_values.loc[:, selector.best_support_]

AttributeError: 'numpy.ndarray' object has no attribute 'loc'

问了gpt知道一般代码会使用Numpy array，于是从头开始开始一行一行代码往下寻找。幸好课件的程序规范有批注：

def plot(y_train = np.empty(0), y_test = np.empty(0), predictions_train = np.empty(0), predictions_test = np.empty(0)):
    
    # The reshape functions in the next two lines, turns each of the
    # vertical NumPy array [[x]
    #                       [y]
    #                       [z]]
    # into python lists [ x, y, z]
    
    # This step is required to create plots with plotly like we did in the previous tutorial

    y_train = y_train.reshape(1,-1).tolist()[0]
    y_test = y_test.reshape(1,-1).tolist()[0]    
    predictions_train = predictions_train.reshape(1,-1).tolist()[0]
    predictions_test = predictions_test.reshape(1,-1).tolist()[0]
    k = np.arange(-50,21000).reshape(1,-1).tolist()[0]

于是用gpt把Numpy array改回来：

#将NumPy array格式改回DataFrame
train_values = pd.DataFrame(train_values)
train_labels = pd.DataFrame(train_labels)

最后为了套用画图的代码再改回Numpy array：

#We will rewrite the arrays with the patches we made on the dataset by turning the dataframe back into a list of lists

all_values = [list(df.iloc[x]) for x in range(len(all_values))]

# SETS

# List of lists are turned into Numpy arrays to facilitate calculations in steps to follow.
all_values = np.array(all_values, dtype = float) 
print("Shape of Values:", all_values.shape)
all_labels = np.array(all_labels, dtype = float)
print("Shape of Labels:", all_labels.shape)

总而言之，就是绕了很大很大一圈。本来只有一步之遥，但在岔路口选择了另一个方向。归根结底还是基础不够扎实。后来想想，其实如果没有学DataFrame，甚至也不会有这样的错误。