matlab对波士顿房价进行分析及预测

最新推荐文章于 2025-03-25 13:14:20 发布

HHXC浩瀚星辰

最新推荐文章于 2025-03-25 13:14:20 发布

阅读量1.2w

点赞数 56

分类专栏： MATLAB实战笔记文章标签：深度学习神经网络线性规划 cart分类回归树

本文链接：https://blog.csdn.net/qq_42593798/article/details/116407087

版权

MATLAB实战笔记专栏收录该内容

4 篇文章

订阅专栏

一、数据分析
二、BP神经网络预测
三、线性回归预测
四、房价分类
注：详细代码及原文解析传送门
波士顿住房数据是从卡内基梅隆大学维护的StatLib图书馆中获取的数据集，本文在数据上实现一些回归技术，并在改变因变量之后实现一些分类技术。它有506个实例和13个变量，还有一个因变量房价。属性信息如下：
在这里插入图片描述

一、数据分析

首先导入数据，可视化所有变量与房价的散点图，并绘制相关系数混淆矩阵，这里，使用corrcoef()函数来查找相关系数，使用plotmatrix()函数来一次查找所有特征和因变量的散点图，使用imagesc()函数来可视化相关系数代码如下：

%importing boston data
filename = 'boston.txt';
delimiterIn = ' ';
BO = importdata(filename,delimiterIn);
%standardizing the data
%BO=BO./std(BO,0,1);
%dividing boston data into training and test
%choosing 450 instance rows without repetition to put in training set
[M,N]= size(BO);
r = randperm(M,450);
training_data= BO(r, :);
%picking data which are not in training set and put in test data, which
%won't be used till the end
nr = setdiff( 1:M , r);
test_data= BO(nr,:);
%seperate predictors and dependent variable
trainingx=  training_data(:,1:N-1);
trainingy= training_data(:,N);
testx=  test_data(:,1:N-1);
testy= test_data(:,N);
figure;
plotmatrix(BO);
saveas(gcf,'plotmatrix.png')
figure;
%finding correlations between features
R= corrcoef(training_data(:,:));
imagesc(R);
caxis([-1,1]);
colorbar
saveas(gcf,'correlationimage.png')

可视化图像如下
在这里插入图片描述

从散点图和相关系数中可以看出，因变量和RM平均房间数之间存在正线性关系，这是预期的。你认为随着房间号变大，房子会变得更贵。此外，因变量和LSTA比率之间有很强的负线性关系，这也不奇怪。地位较低的人必须定居在更便宜的街区。

二、BP神经网络预测

%%Import Data 
formatSpec = '%8f%7f%8f%3f%8f%8f%7f%8f%4f%7f%7f%7f%7f%f%[^\n\r]'; 
fileID = fopen(filename,'r'); 
dataArray = textscan(fileID, formatSpec, 'Delimiter', '', 'WhiteSpace', '', 'ReturnOnError', false); 
fclose(fileID); 
housing = table(dataArray{1:end-1}, 'VariableNames', {'VarName1','VarName2','VarName3','VarName4','VarName5','VarName6','VarName7','VarName8','VarName9',... 
'VarName10','VarName11','VarName12','VarName13','VarName14'}); 
%Delete the file and clear temporary variables 
clearvars filename formatSpec fileID dataArray ans; 
%%delete housing.txt 
 
%%Read into a Table 
housing.Properties.VariableNames = housingAttributes; 
X = housing{:,inputNames}; 
y = housing{:,outputNames};

len = length(y);
index = randperm(len);%生成1~len 的随机数

%%产生训练集和数据集
%训练集——前70%
p_train = X(index(1:round(len*0.7)),:);%训练样本输入
t_train = y(index(1:round(len*0.7)),:);%训练样本输出
%测试集——后30%
p_test = X(index(round(len*0.7)+1:end),:);%测试样本输入
t_test =y(index(round(len*0.7)+1:end),:);%测试样本输出

%%数据归一化
%输入样本归一化
[pn_train,ps1] = mapminmax(p_train');
pn_test = mapminmax('apply',p_test',ps1);
%输出样本归一化
[tn_train,ps2] = mapminmax(t_train');
%tn_test = mapminmax('apply',t_test',ps2);

%%神经网络
%创建和训练
net = feedforwardnet(5,'trainlm');%创建网络
net.trainParam.epochs = 5000;%设置训练次数
net.trainParam.goal=0.0000001;%设置收敛误差
[net,tr]=train(net,pn_train,tn_train);%训练网络
%网络仿真
b=sim(net,pn_test);%放入到网络输出数据

%%结果反归一化
predict_prices = mapminmax('reverse',b,ps2);

%%结果分析
t_test = t_test';
err_prices = t_test-predict_prices;
[mean(err_prices) std(err_prices)]
figure(1);
plot(t_test);
hold on;
plot(predict_prices,'r');
xlim([1 length(t_test)]);
hold off;
legend({'Actual','Predicted'})
xlabel('Training Data point');
ylabel('Median house price');

用到的模型结构
在这里插入图片描述

结果如下：在这里插入图片描述

三、线性回归预测

线性回归是一种假设特征对因变量的影响是线性的模型。用最小二乘法拟合。为了在MatLab中实现它，使用fitlm()函数和feval()函数进行预测。代码如下：


lm = fitlm(trainingx,trainingy,'linear')
testyhatL= feval(lm,testx);
testerrorL= (1/56)*sum((testy-testyhatL).^2);
trainingxnew= horzcat(trainingx(:,1:2),trainingx(:,4:6),trainingx(:,8),trainingx(:,10:end));
testxnew= horzcat(testx(:,1:2),testx(:,4:6),testx(:,8),testx(:,10:end));
lmnew = fitlm(trainingxnew,trainingy,'linear');
testyhatLnew= feval(lmnew,testxnew);
testerrorLnew= (1/56)*sum((testy-testyhatLnew).^2);

lmQ = fitlm(trainingx,trainingy,'quadratic')
testyhatLQ= feval(lmQ,testx);
testerrorLQ= (1/56)*sum((testy-testyhatLQ).^2);

figure
plot(testy)
hold on
plot(testyhatLQ)
hold off
saveas(gcf,'testy and y hat of LQ.png')

预测值和真实值对比图如下：
在这里插入图片描述
线性回归的测试误差为24.13，比回归树高。r平方变为0.736，这意味着响应数据围绕其平均值的近73%的可变性由模型解释，这是可接受的，但不如该数据集的树好。我还检查了INDUS和AGE特征的p值，它们具有非常高的人值0.76和0.68，这表明它们是可以避免的，它们对数据说得很少。因此，我做了另一个没有它们的拟合线性模型，在代码中是新的。它给出的测试误差为28.57，与之前的测试误差相差不大。我还实现了二次线性回归，它也将变量的成对乘法和变量的平方作为预测因子。它在更大的维度上进行线性回归，并向后投影，这意味着边界看起来像二次而不是线性的。这里测试误差变成了13.47，比简单的线性回归好得多，和树一样好。它的R平方值也很大很大0.929。

四、房价分类

现在将改变数据中的因变量来实现一些分类技术。y变量是房屋的平均值，将价值小于15的房屋分类为1(便宜)，15到30之间的为2(中等)，超过30的为3(昂贵)。然后再通过使用训练数据，试着建立一个模型来猜测未来的y值。

%changing y from continuous values to values from the set 1,2,3 (which represents cheap, medium and expensive houses)
BOS=BO;
for i=1:506
if BO(i,14)<15
 BOS(i,14)=1;   
elseif BO(i,14)>=15 && BO(i,14)<=30
   BOS(i,14)=2;  
else
   BOS(i,14)=3;  
end
end
training_dataC= BOS(r, :);
nr = setdiff( 1:M , r);
test_dataC= BOS(nr,:);
trainingxC=  training_dataC(:,1:N-1);
trainingyC= training_dataC(:,N);
testxC=  test_dataC(:,1:N-1);
testyC= test_dataC(:,N);
mdlknn= fitcknn(trainingxC,trainingyC,'NumNeighbors',5,'Standardize',1);
testyChat= predict(mdlknn, testxC);

confusionmat(testyC,testyChat);
loss= ones(1,10);
kloss=ones(1,10);
for i= 1:10
mdlknn= fitcknn(trainingxC,trainingyC,'NumNeighbors',i,'Standardize',1)
    loss(i) = resubLoss(mdlknn);
CVmdlknn = crossval(mdlknn,'KFold',5);
kloss(i) = kfoldLoss(CVmdlknn);
end
plot(1:10,loss,'r');
hold on
plot(1:10,kloss);
saveas(gcf,'loss and kloss for knn different ks.png');

进一步得到混淆矩阵和loss
在这里插入图片描述

在该技术中，对于每个点，我们查看该点的k个最近邻居，并取它们的类的平均值，舍入到最近的类，舍入的平均类是该点的指定类。用k=5的fitcknn()函数建立模型，用predict()函数预测值。这里用混淆矩阵；它的维度是类的数量。如果在第(I，j)个条目上有整数l，它意味着属于I并被分类为j的数据点的数目。因此，对角线上的数的和给出正确分类的点数，而非对角线条目上的数的和给出错误分类的点数。对于这个模型和数据集，得到了7个错误分类。但是为什么选择邻居数5，它可以是任何正整数。为了找到最佳参数，我已经通过对从1到10的每个邻域数使用resubLoss()和kfoldLoss()函数找到了恢复误差和k倍误差。不出所料，补缺错误过于乐观。在数据中，如图所示，选择不同数量的k之间的差异不明显。所以可以继续选择5。