人工智能第一次作业
1 问题设定
性能衡量指标使用均方根误差RMSE
2 获取数据
(1) 加载数据集方法的代码为:
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
housing = load_housing_data()
(2) 数据集的总行数为20640,每个属性的类型为下图所示
其中total_bedrooms
属性有缺失值
(3) ocean_proximity
的取值情况一共有5种:
(4) 使用如下方法查看数据集中所有数值型属性的平均值,最大值和最小值
housing.describe()
获取结果如下:
(5) 使用如下代码查看数据集中所有数值型属性的直方图:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show()
判断某个属性取值被设定上限的方法:由于数值型数据一般符合正态分布,临近上边界值时,数据量应该十分稀少,如果数据量反常的多,就可以判定该属性取值被设定上限。一共有两个数据被设定上限:housing_median_age
,median_house_value
直方图的“重尾”:即直方图图形在中位数右侧的延伸比左侧多的多。因为一般而言直方图数据为正态分布的时候会比较容易得到理想的结果,所以需要进行处理。
(6) 一般有随机抽样和分层抽样两种方法。本章使用分层抽样的方法会更好。因为本数据集的数量较多,不同类型数据量并不相同,如果使用纯随机抽样,可能会导致高收入和低收入人群的数据量较少,中等收入人群数据量较大,也就是数据误差较大,对于后续的数据预测造成不利影响
3 研究数据
(1) 不应该对测试集进行研究。因为需要保证测试集和训练集之间的独立,训练集训练出的模型是为了预测未来未知的数据,而不是测试集这种已知数据,如果将测试集进行处理,那么虽然使用训练集训练出的模型可能会在测试集中测试良好,但那对于未来数据预测没有意义,反而会错误判断模型的好坏
(2) 可以用corr()
函数来计算每对属性间的标准相关系数(皮尔逊相关系数)corr_matrix = housing.corr()
。也可以使用Pandas
的scatter_matrix
函数
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")
(3) 增加三个属性后,各属性与房价中位数的关系系数如下图
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
可以看到与房价中位数相关系最高的四个属性分别为:median_income
,bedrooms_per_room
,rooms_per_household
,以及latitude
4 准备数据
(1) 需要对测试集的数据进行相关处理,因为需要保证训练集和测试集数据分布一致
(2) 处理缺失值的方法有三种:去掉对应的行(数据样本):用DataFrame的dropna()方法;去掉整个属性:用DataFrame的drop()方法;进行赋值(0、平均值、中位数等等):用DataFrame的fillna()方法。这里采用给缺失值赋中位数的方法。
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
sample_incomplete_rows
(3) 因为文本属性不能计算中位数,且后续很多数据处理算法都用的数值属性数据,所以需要将ocean_proximity
转换为数值属性
(4) 有两种方法:利用转换器LabelEncoder;利用编码器OneHotEncoder。这里使用编码器OneHotEncoder,将整数分类值转变为独热向量
(5) 应用sklearn提供的CategoricalEncoder类,用于标签列的转换,添加附加的属性
from sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=True)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs = pd.DataFrame(
housing_extra_attribs,columns=list(housing.columns)+["rooms_per_household", "population_per_household","bedrooms_per_room"])
housing_extra_attribs.head()
(6) 使用了。特征缩放使用线性归一化和标准化。线性归一化通过减去最小值,然后再除以最大值与最小值的差值,来进行归一化。标准化首先减去平均值(所以标准化值的平均值总是0),然后除以方差,使得到的分布具有单位方差
(7) 流水线处理代码如下:
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer',LabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
5 研究模型
(1) 训练SVM的代码如下:
from sklearn.svm import SVR
svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse
训练集上的RMSE为
(2) 10折交叉验证代码如下:
from sklearn.model_selection import cross_val_score
svm_scores = cross_val_score(svm_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
svm_rmse_scores = np.sqrt(-svm_scores)
display_scores(svm_rmse_scores)
结果如下:
6 微调模型
(1) a)
网格搜索选择为最佳超餐代码:
from sklearn.model_selection import GridSearchCV
param_grid = [
{'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
{'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
]
svm_reg = SVR(kernel="linear")
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(svm_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',verbose=2,n_jobs=-1, return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
grid_search.best_params_
最佳超参为
最佳超参时,验证集上的RMSE为:
(1) )b
随机搜索寻找超参代码:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal
param_distribs = {
'kernel': ['linear', 'rbf'],
'C': reciprocal(20, 200000),
'gamma': expon(scale=1.0),
}
svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
n_iter=50, cv=5, scoring='neg_mean_squared_error',
verbose=2, n_jobs=4, random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
negative_mse = rnd_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse
rnd_search.best_params_
最佳超参为:{‘C’: 157055.10989448498, ‘gamma’: 0.26497040005002437, ‘kernel’: ‘rbf’}
此时RMSE为:54767.99053704408
(2) 代码为:
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
得出RMSE结果为:47730.22690385927