本人的首次Python实践任务,花了一晚上4个小时时间,将本该一步完成的任务绕了一大圈,最后还是剩个Warning去不掉。。。
内容是大学课程大作业,要求用Forward Selection预测材料性能,并提供了github教程GitHub - chris-santiago/steps: A SciKit-Learn style feature selector using best subsets and stepwise regression.
import pandas as pd
from steps.forward import ForwardSelector
selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(X, y)
X.loc[:, selector.best_support_]
1.正确答案
代码就三行。模型中X,y应分别为训练集train_values和目标函数train_labels。但正确的代码应改写为:
from steps.forward import ForwardSelector
selector = ForwardSelector(normalize=True, metric='aic')
selector.fit(train_values, train_labels)
train_values[:, selector.best_support_]
可以看出唯一的区别是删去了.loc。
2.错误示范
其实ForwardSelector库本身就可以接受Numpy array或者DataFrame的输入。但本人就在这前一天学了DataFrame的使用,并“敏锐”地在课件前面的dataset getting中找到了df
# Pandas Dataframe
all_labels = df['density_of_solid'].tolist()
df = df.drop(['density_of_solid'], axis=1)
df.head(n=10) # With this line you can see the first ten entries of our database
于是理所当然地认为这个库的输入为DataFrame.当然,运行之后会有这样的报错:
AttributeError Traceback (most recent call last)
<ipython-input-32-dbad8d7d0bd2> in <module>
2 selector = ForwardSelector(normalize=True, metric='aic')
3 selector.fit(train_values, train_labels)
----> 4 train_values.loc[:, selector.best_support_]
AttributeError: 'numpy.ndarray' object has no attribute 'loc'
问了gpt知道一般代码会使用Numpy array,于是从头开始开始一行一行代码往下寻找。幸好课件的程序规范有批注:
def plot(y_train = np.empty(0), y_test = np.empty(0), predictions_train = np.empty(0), predictions_test = np.empty(0)):
# The reshape functions in the next two lines, turns each of the
# vertical NumPy array [[x]
# [y]
# [z]]
# into python lists [ x, y, z]
# This step is required to create plots with plotly like we did in the previous tutorial
y_train = y_train.reshape(1,-1).tolist()[0]
y_test = y_test.reshape(1,-1).tolist()[0]
predictions_train = predictions_train.reshape(1,-1).tolist()[0]
predictions_test = predictions_test.reshape(1,-1).tolist()[0]
k = np.arange(-50,21000).reshape(1,-1).tolist()[0]
于是用gpt把Numpy array改回来:
#将NumPy array格式改回DataFrame
train_values = pd.DataFrame(train_values)
train_labels = pd.DataFrame(train_labels)
最后为了套用画图的代码再改回Numpy array:
#We will rewrite the arrays with the patches we made on the dataset by turning the dataframe back into a list of lists
all_values = [list(df.iloc[x]) for x in range(len(all_values))]
# SETS
# List of lists are turned into Numpy arrays to facilitate calculations in steps to follow.
all_values = np.array(all_values, dtype = float)
print("Shape of Values:", all_values.shape)
all_labels = np.array(all_labels, dtype = float)
print("Shape of Labels:", all_labels.shape)
总而言之,就是绕了很大很大一圈。本来只有一步之遥,但在岔路口选择了另一个方向。归根结底还是基础不够扎实。后来想想,其实如果没有学DataFrame,甚至也不会有这样的错误。