第五课：回归分析

pxyp123

已于 2022-03-11 18:29:12 修改

阅读量1.6k

点赞数

分类专栏： Matlab学习文章标签：回归 tv 数据挖掘算法决策树

于 2022-03-11 18:23:09 首次发布

本文链接：https://blog.csdn.net/pxyp123/article/details/123425987

版权

Matlab学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

三 Nonparametric methods

一线性回归

1.理论

2.代码

% Perform the parametric linear regression 
clear, close all
clc

%% load Advertising data: 
% Sales (in thousands of units) for a particular product as a function of advertising budgets 
% (in thousands of dollars) for TV, radio, and newspaper media.
[num,txt,raw]=xlsread('Advertising.csv');
Sales=num(:,5);
TV=num(:,2);
Radio=num(:,3);
Newspaper=num(:,4);

%% Perform univariate linear regression on TV advertising budget
%%% Regress "Sales" on "TV"
mdl_tv = fitlm(TV,Sales)
% mdl_tv.Coefficients    
% mdl_tv.RMSE
% mdl_tv.Rsquared.Ordinary

figure,
scatter(TV,Sales), hold on
plot([min(TV), max(TV)],[predict(mdl_tv,min(TV)), predict(mdl_tv,max(TV))],'k-','linewidth',2)
xlabel('TV budgets'), ylabel('Sales of product')
title(['Sales=' num2str(mdl_tv.Coefficients.Estimate(1)) '+' ...
          num2str(mdl_tv.Coefficients.Estimate(2)) '*TV',...
          ', R^2=' num2str(mdl_tv.Rsquared.Ordinary)])

%% Perform multiple linear regression on all advertising budgets
mdl_multiple = fitlm([TV,Radio,Newspaper],Sales)

%%% check the correlations between different types of budgets
[rho,p]=corr([TV,Radio,Newspaper]);

结果如下：

scatter(TV,Sales) 画出散点图

mdl_tv = fitlm(TV,Sales) 表示单因子线性回归分析

plot([min(TV), max(TV)],[predict(mdl_tv,min(TV)), predict(mdl_tv,max(TV))],'k-','linewidth',2)

[min(TV), max(TV)] 给出图形的范围

predict函数表示给定一个模型，还有一个输入参数，能够得出模型的输出参数

mdl_multiple = fitlm([TV,Radio,Newspaper],Sales) 多因子回归分析

[rho,p]=corr([TV,Radio,Newspaper]) 衡量三个系数之间的相关性。

rho得出任意两个系数之间的相关系数，是一个系数矩阵，越大则相关性越强；p得出相关系数的显著性，是一个矩阵，越小则越显著

R^2：代表拟合效果，越接近1越好。 F：越大越好 pValue：代表显著性，越小越好

二非线性回归

1.理论

% Perform the parametric polynomial regression 
clear, close all

%% load Auto data: 
[num,txt,raw]=xlsread('Auto.csv');
mpg=num(:,1); % gas mileage in miles per gallon
horsepower=num(:,4); 

excludeind=find(isnan(horsepower));  % identify the indices with missing horsepower data
mpg(excludeind,:)=[]; % remove the observations with missing horsepower data
horsepower(excludeind,:)=[]; % remove the observations with missing horsepower data

%% perform polynomial regression
p_d1=polyfit(horsepower,mpg,1); % linear
RSS=sum((polyval(p_d1,horsepower)-mpg).^2); % residual sum of square
TSS=sum((mpg-mean(mpg)).^2); % total sum of square
R_square_d1=1-RSS/TSS;

p_d2=polyfit(horsepower,mpg,2); % Degree 2
RSS=sum((polyval(p_d2,horsepower)-mpg).^2); % residual sum of square
TSS=sum((mpg-mean(mpg)).^2); % total sum of square
R_square_d2=1-RSS/TSS;

p_d5=polyfit(horsepower,mpg,5); % Degree 5
RSS=sum((polyval(p_d5,horsepower)-mpg).^2); % residual sum of square
TSS=sum((mpg-mean(mpg)).^2); % total sum of square
R_square_d5=1-RSS/TSS;

figure,
scatter(horsepower,mpg), hold on
x=sort(horsepower);
plot(x,polyval(p_d1,x),'k-','linewidth',2)
plot(x,polyval(p_d2,x),'r-','linewidth',2)
plot(x,polyval(p_d5,x),'g-','linewidth',2)
xlabel('horsepower'), ylabel('mpg')
legend('Observations',...
             ['Linear, R^2=' num2str(R_square_d1)],...
             ['Degree 2, R^2=' num2str(R_square_d2)],...
             ['Degree 5, R^2=' num2str(R_square_d5)])

p_d2=polyfit(horsepower,mpg,2);

horsepower：输入，x轴 mpg：y轴 2：最高只有2次项

polyval(p_d2,horsepower) 预测函数

三 Nonparametric methods

3.1理论

3.2Decision Trees

3.3代码

准备数据集，数据预处理

将数据转换为table类型，并且第一行的数据不转换：

data=cell2table(raw(2:end,:))

定义变量名：

for i=1:size(data,2)
    data.Properties.VariableNames{i}=raw{1,i}; % assign the VariableNames   
end

去除无用数据：

excludeind=find(isnan(data.Salary));  % identify the indices with missing Salary data
data(excludeind,:)=[]; % remove the players with missing Salary data

取log函数（其实是ln，自然对数为底）使数据更平滑：

data.Salary=log(data.Salary); % log-transform to have a typical bell-shape distribution

数据集分解：

建立模型

第一步：

选定六个指标；选定salary作为衡量标准，交叉验证，分为5折；view函数为画出树图，如下：

第二步：修建树图

改prunelevel或者prunealpha

也可以直接在图片上面修改：

查看参数的重要程度：

imp = predictorImportance(tree_6v);

第三步：只用两个参数运行模型

完整代码：

% Perform the regression tree using "fitrtree" function
clear, close all

%% load Hitters data: Major League Baseball Data from the 1986 and 1987 seasons.
[num,txt,raw]=xlsread('Hitters.csv');

% construct Table array
data=cell2table(raw(2:end,:));
for i=1:size(data,2)
    data.Properties.VariableNames{i}=raw{1,i}; % assign the VariableNames   
end
excludeind=find(isnan(data.Salary));  % identify the indices with missing Salary data
data(excludeind,:)=[]; % remove the players with missing Salary data

data.Salary=log(data.Salary); % log-transform to have a typical bell-shape distribution

%% Separate data into training (70%) and test (30%) datasets
rng(0,'twister')  % For reproducibility，可删除这一行代码，只是为了保证随机数是一样的

C = cvpartition(size(data,1),'holdout',0.30); % randomly hold out 30% of subjects for test
dataTrain = data(C.training,:);
dataTest = data(C.test,:);

%% [Method 1] Construct a regression tree using six variables/features
% Construct a regression tree using the training dataset
predictors={'Years','Hits','RBI','PutOuts','Walks','Runs'};
tree_6v = fitrtree(dataTrain,'Salary','PredictorNames',predictors,...
    'OptimizeHyperparameters','all',...
    'HyperparameterOptimizationOptions',...
    struct('AcquisitionFunctionName','expected-improvement-plus','kfold',5));
view(tree_6v,'Mode','graph')

% Test the tree model on the test dataset
Salary_predict=predict(tree_6v,dataTest);
MSE=loss(tree_6v,dataTest);  % mean squared error, MSE=mean((Salary_predict-dataTest.Salary).^2);

RSS=sum((Salary_predict-dataTest.Salary).^2); % residual sum of square
TSS=sum((dataTest.Salary-mean(dataTest.Salary)).^2); % total sum of square
R_square=1-RSS/TSS;

figure,plot(1:size(dataTest,1),dataTest.Salary,'r.-',1:1:size(dataTest,1),Salary_predict,'b.-')
title(['[Method 1] MSE = ' num2str(MSE) ', R^2 = ' num2str(R_square)])
legend('True salaries','Predicted salaries')
xlabel('Players'),ylabel('Salaries')

%% [Method 2] Construct a regression tree using six variables/features with pruning
prunelevel=6;
tree_prune = prune(tree_6v,'level',prunelevel);

% prunealpha=0.03;
% tree_prune = prune(tree_6v,'alpha',prunealpha);

view(tree_prune,'Mode','graph')

% Test the tree model on the test dataset
Salary_predict=predict(tree_prune,dataTest);
MSE=loss(tree_prune,dataTest);  % mean squared error, MSE=mean((Salary_predict-dataTest.Salary).^2);

RSS=sum((Salary_predict-dataTest.Salary).^2); % residual sum of square
TSS=sum((dataTest.Salary-mean(dataTest.Salary)).^2); % total sum of square
R_square=1-RSS/TSS;

figure,plot(1:size(dataTest,1),dataTest.Salary,'r.-',1:1:size(dataTest,1),Salary_predict,'b.-')
title(['[Method 2] MSE = ' num2str(MSE) ', R^2 = ' num2str(R_square)])
legend('True salaries','Predicted salaries')
xlabel('Players'),ylabel('Salaries')

%% Variable/Feature selection based on the importance scores
imp = predictorImportance(tree_6v);

figure;
bar(imp);
title('Predictor Importance Estimates');
ylabel('Estimates');
xlabel('Predictors');
h = gca;
h.XTickLabel = tree_6v.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';

%% [Method 3] Construct a regression tree using key variables/features
% Construct a regression tree using the training dataset
predictors={'Years','Hits'};  % only use the two key variables/features 
tree_2v = fitrtree(dataTrain,'Salary','PredictorNames',predictors,...
    'OptimizeHyperparameters','all',...
    'HyperparameterOptimizationOptions',...
    struct('AcquisitionFunctionName','expected-improvement-plus','kfold',5));
view(tree_2v,'Mode','graph')

% Test the tree model on the test dataset
Salary_predict=predict(tree_2v,dataTest);
MSE=loss(tree_2v,dataTest);  % mean squared error, MSE=mean((Salary_predict-dataTest.Salary).^2);

RSS=sum((Salary_predict-dataTest.Salary).^2); % residual sum of square
TSS=sum((dataTest.Salary-mean(dataTest.Salary)).^2); % total sum of square
R_square=1-RSS/TSS;

figure,plot(1:size(dataTest,1),dataTest.Salary,'r.-',1:1:size(dataTest,1),Salary_predict,'b.-')
title(['[Method 3] MSE = ' num2str(MSE) ', R^2 = ' num2str(R_square)])
legend('True salaries','Predicted salaries')
xlabel('Players'),ylabel('Salaries')

四 APP使用

比较适合熟悉基本的模型，但是因为是一个包装好的模型，所以不能调节具体的参数，如果真正的要运行一个模型并且查看和调节各种参数，还是需要自己去写代码来运行。

下面是使用APP的具体方法！

开始后会回到新建界面，如下：

如下所示，也可以贪心一点，全部都选，选择All

单机任意一个模型之后，可以查看各种效果图：

pxyp123

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
第五课：回归分析

一线性回归1.理论2.代码% Perform the parametric linear regression clear, close allclc%% load Advertising data: % Sales (in thousands of units) for a particular product as a function of advertising budgets % (in thousands of dollars) for TV, radio,.
复制链接

扫一扫