Machine Learning with MATLAB 1.1 to 2.2

Machine Learning with MATLAB 1.1 to 2.2

1.1 Course Overview

1.2 Review - Machine Learning Onramp

Two files contain data for a selection of basketball players.

bballPlayers.txtbballStats.txt
This file contains information about each player, like their position, height, and weight.This file contains player statistics for each year, such as games played, points scored, and rebounds.

In this lesson, you will:

  • Import and format the data stored in the files.
  • Group statistics by player and merge the data sets into a single table.
  • Visualize various player statistics and explore features.
  • Train a classification model to predict a player’s position.
  • Evaluate the model’s performance.
1.2.1 Import Data
positions = ["G","G-F","F-G","F","F-C","C-F","C"]
Task 1

To bring data from a file named dataFile.txt into MATLAB as a table named T, you can use the readtable function.

playerInfo = readtable("bball")
Task 2

When text labels are intended to represent a finite set of possibilities, such as a player’s position, it’s more suitable to store the data as a categorical array.

You can use the categorical function to convert an array of text labels to a categorical array.

playerInfo.pos = categorical(playerInfo.pos)	%convert an array of text to a categorical array.

The function categories returns a list of all possible categories in a categorical array.

categories(playerInfo.pos)	%output: ["F-C-G","F-G-C","G","G-F","F-G","F","F-C","C-F","C"]
Task 3

Sometimes you may want to define the categories for your categorical data. For example, if you are pulling data from multiple sources and not all the categories are represented in that particular set.

You can use a string array of category names as a second input to the categorical function to specify the categories.

playerInfo.pos = categorical(playerInfo.pos,positions)

Use the categories defined in the string array positions to convert the data in playerInfo.pos into a categorical.

categories(playerInfo.pos)	%output:["G","G-F","F-G","F","F-C","C-F","C"]

Other classifications are forced to convert to <undefined>.

Task 4

Any data not specified by a category in positions becomes <undefined>.

To remove rows from a table which contain an undefined or missing value, you can use the rmmissing function.

playerInfo.pos = rmmissing(playerInfo.pos)
Task 5
allStats = readtable("bballStats.txt")
Task 6

You can remove rows or variables from a table by first selecting those rows or variables and then assigning to them an empty array.

For example, the following command remove everything after the 18th column in allStats.

allStats(:,19:end) = []
1.2.3 Group and Merge Data
Task 1

The groupsummary function performs grouped calculations. For example, the following command calculates the standard deviation (std) of the data in data, grouped by data.Label.

stdevData = groupsummary(data,"Label","std")

Create a table named playerStats which calculates the sum of the data in allStats grouped by "playerID".

You may leave off the semicolon to view the output.

playerStats = groupsummary(allStats,"playerID","sum")
Task 2

The table playerStats has variable names similar to those in allStats, but prepended with sum_. It also has an additional variable named GroupCount.

You can access the variable names in a table using the VariableNames property of the Properties of the table.

Remember you can remove a variable from a table by assigning it the empty array [].

table.Properties.VariableNames
table.variable = []

Remove the variable GroupCount from playerStats. Then replace the variable names in playerStats with the variable names in allStats.

playerStats.GroupCount = [];
allStats.Properties.VariableNames = playerStats.Properties.VariableNames
Task 3

You can combine, or join, two tables by matching up rows with the same key variable values. The key variable playerID can be used to join the data from playerInfo and playerStats.

The innerjoin function joins two tables, and includes only observations whose key variable values appear in both tables.

Create a table named data which joins playerInfo and playerStats, and includes only players who appear in both tables.

data = innerjoin(playerInfo,playerStats)
Further Practice
>> Join Table

This command allow you to operate in a visual window.

1.2.4 Explore Data
Task 1

You can plot some results from the basketball player data to explore relationships between various statistics and player position.

Create a box plot of player height for each position.

boxplot(data.height,data.pos)
ylabel("Height (inches)")

Task 2

It looks as though guards (G) are generally shorter than forwards (F) or centers ©. What other patterns can we find in the data?

You can use gscatter to explore the relationship between two variables, grouped by position.

Plot points against rebounds, grouped by position.

gscatter(data.rebounds,data.points,data.pos)

Task 3

You can see some groupings for different positions. However, the data’s range makes it difficult to compare players, since most players are clustered tightly around the origin.

Some players played much more than others, leading to higher total points and rebounds. One way to account for this difference is to divide points and rebounds by the number of games played.

In the table data, the variable GP contains the number of games played. You can use element-wise division (./) to calculate per game statistics.

Plot points per game against rebounds per game, grouped by player position.

gscatter(data.rebounds./data.GP,data.points./data.GP,data.pos)

Task 4

The data points are no longer clustered around the origin, but the data points for each position are still spread out. Can a different normalization yield more insight?

Another way to account for difference in play time is to divide by the number of minutes played, data.minutes.

Plot points per minute against rebounds per minute, grouped by player position.

gscatter(data.rebounds./data.minutes,data.points./data.minutes,data.pos)
1.2.5 Train a Model and Make Predictions
Task 1

The normalized (per minute) numeric statistics from the basketball player data set has been divided into a training set dataTrain and a testing set dataTest. You will train a classification model using the training set, then make predictions for the testing set.

A k-nearest neighbor (kNN) model classifies an observation as the same class as the nearest known examples. You can fit a kNN model by passing a table of data to the fitcknn function.

The second input is the name of the response variable in the table (that is, the variable you want the model to predict). The output is a variable containing the fitted model.

Fit a kNN model to the data stored in dataTrain. The known classes are the player positions, stored in the variable named "pos". Store the fitted model in a variable called knnmodel.

knnmodel = fitcknn(dataTrain,"pos")
Task 2

The predict function determines the predicted class of new observations.

The inputs are the trained model and a table of observations, with the same predictor variables as was used to train the model. The output is a categorical array of the predicted class for each observation in newdata.

Predict the positions for the data in dataTest. Store the predictions in a variable called predPos.

predPos = predict(knnmodel,dataTest)
Task 3

How well did the kNN model predict player position?

A commonly-used metric to evaluate a model is the misclassification rate (the proportion of incorrect predictions). This metric is also called the model’s loss.

You can use the loss function to calculate the misclassification rate for a data set.

Calculate the misclassification rate for dataTest, and assign the result to the variable mdlLoss.

mdlLoss = loss(knnmodel,dataTest)
output : mdlloss = 0.63224826

The loss value calculated using the previous method will be slightly different from the value calculated by this method:

allwrong = sum(predPos ~= dataTest.pos)
rate = allwrong / numel(predPos)
output : 0.63186813
Task 4

The loss value indicates that over 60% of the positions were predicted incorrectly. Did the model misclassify some positions more than others?

A confusion matrix gives the number of observations from each class that are predicted to be each class. It’s commonly visualized by shading the elements according to their value, with the diagonal elements (the correct classifications) shaded in one color and the other elements (the incorrect classifications) in another color. You can visualize a confusion matrix using the confusionchart function.

confusionchart(ytrue,ypred);

ytrue is a vector of the known classes and ypred is a vector of the predicted classes.

The table dataTest contains the known player positions, which you can compare with the predicted positions, predPos.

Use the confusionchart function to compare predPos to the known labels (stored in dataTest.pos).

confusionchart(dataTest.pos,predPos)

1.2.6 Evaluate the Model and Iterate
Task 1

By default, fitcknn fits a kNN model with k = 1. That is, the model uses the class of the single closest “neighbor” to classify a new observation.

The model’s performance may improve if the value of k is increased – that is, it uses the most common class of several neighbors, instead of just one.

You can change the value of k by setting the "NumNeighbors" property when calling fitcknn.

mdl = fitcknn(table,"ResponseVariable", ...
    "NumNeighbors",7)

Modify the fitcknn function call on line 3. Set the "NumNeighbors" property to 5.

knnmodel = fitcknn(dataTrain,"pos","NumNeighbors",5);
Task 2

Using 5 nearest neighbors reduced the loss, but the model still misclassifies over 50% of the test data set.

Many machine learning methods use the distance between observations as a similarity measure. Smaller distances indicate more similar observations.

In the basketball data set, the statistics have different units and scales, which means some statistics will contribute more than others to the distance calculation. Centering and scaling each statistic makes them contribute more evenly.

By setting the "Standardize" property to true in the fitcknn function, each column of predictor data is normalized to have mean 0 and standard deviation 1, then the model is trained using the standardized data.

Modify line 3 again. Add to the fitcknn function call to also set the "Standardize" property to true

knnmodel = fitcknn(dataTrain,"pos","NumNeighbors",5);
1.2.7 Course Quick Reference

2.1 Course Example - Grouping Basketball Players

2.2 Low Dimensional Visualization

2.2.3 Multidimensional Scaling

dimension : 维度

approximate representation : 近似表示方法

Principle Component Analysis(PCA) and Classical Multi-dimensional Scaling : 主成分分析(PCA)和经典的多维缩放法

orthogonal coordinate system : 正交坐标系

Euclidean distance : 欧氏距离

Manhattan Distance (City block) : 曼哈顿距离

eigenvalues : 特征值

2.2.4 Classical Multidimensional Scaling
Task 1 Calculate pairwise distances

The matrix X has 4 columns, and therefore would be best visualized using 4 dimensions. In this activity, you will use multidimensional scaling to visualize the data using fewer dimensions, while retaining most of the information it contains.

load data
whos X
NameSizeByteClassAttributes
X124*43968double

To perform multidimensional scaling, you must first calculate the pairwise distances between observations. You can use the pdist function to calculate these distances.

distances = pdist(data,"distance");

Calculate the pairwise distances between rows of X, and name the result D.

D = pdist(X,"distance")

output :

20210404-150058-0283.png


The output D demonstrate a distance or dissimilarity vector containing the distance between each pair of observations.

D is of length 124(124-1)/2

The input X is an 124-by-4 numeric matrix containing the data. Each of the 124 rows is considered an observation.

The optional input “distance” indicates the method of caculating the distance or dissimilarity. Commonly used methods are:

"euclidean" %(default)
"cityblock"
"correlation"

Task 2 Perform multidimensional scaling

The cmdscale function finds a configuration matrix and its corresponding eigenvalues for a set of pairwise distances.

[configMat,eigVal] = cmdscale(distances);

Find the configuration matrix and its eigenvalues for the distance matrix D, and name them Y and e, respectively.

[Y,e] = cmdscale(D)

output :

20210404-151559-0754.png20210404-155259-0582.png


The cmdscale is that Classical MultiDimensional Scaling.

The input D is a distance or dissimilarity vector.

The output Y is a 124-by-q matrix of the reconstructed coordinates in q-dimensional space.

q is the minimum number of dimensions needed to achieve the given pairwise distances.

e is the Eigenvalues of the matrix x*x'.


You can use the eigenvalues e to determine if a low-dimensional approximation to the points in x provides a reasonable representation of the data. If the first p eigenvalues are significantly larger than the rest, the points are well approximated by the first p dimensions (that is, the first p columns of x).

Task 3 Visualizes relative magnitudes of vector

You can use the pareto function to create a Pareto chart, which visualizes relative magnitudes of a vector in descending order.

pareto(vector)

Create a Pareto chart of the eigenvalues e.

pareto(e)

In this result, the first 3 columns of e are distinctly larger than others, so we can retain them and not losing too much information of raw data.


Task 4 Create a scatter of the first two columns of a matrix M

From the Pareto chart, you can see that over 90% of the distribution is described with just two variables.

You can use the scatter function to create a scatter plot of the first two columns of a matrix M.

scatter(M(:,1),M(:,2))

Use scatter to create a scatter plot of the first two columns of Y.

scatter(Y(:,1),Y(:,2))
Task 5 Creates a three-dimensional scatter plot

From the Pareto chart, notice that 100% of the distribution is described with three variables.

The scatter3 function creates a three-dimensional scatter plot. You can use scatter3 to create a scatter plot of three columns of a matrix M.

scatter3(M(:,1),M(:,2),M(:,3))

Use scatter3 to create a scatter plot of the first three columns of Y.

scatter3(Y(:,1),Y(:,2),Y(:,3))

Different view of the three-dimensional scatter plot.

2.2.5 Nonclassical Multidimensional Scaling

When you use the cmdscale function, it determines how many dimensions are returned in the configuration matrix.

To find a configuration matrix with a specified number of dimensions, you can use the mdscale function.

In fact, cmdscale() alow you to determine the number of dimensions return to a variable, too.

configMat = mdscale(distances,numDims);

Calculate the pairwise distance between rows of X, and name it D. Then find the configuration matrix in 2 dimensions of the distances and name it Y.

load data
whos X
%%%%%%%%%%
D = pdist(X)
Y = cmdscale(D,2)
% Y = mdscale(D,2)
scatter(Y(:,1),Y(:,2))

20210404-160120-0867.png

20210404-161127-0675.png

20210404-161120-0648.png

2.2.6 Principal Component Analysis (PCA)

Another commonly used method for dimensionality reduction is principal component analysis (PCA). Use the function pca to perform principal component analysis.

[pcs,scrs,~,~,pexp] = pca(data)

pca() takes the raw observations as input.

But cmdscale() and mdscale() requires a distance array as input.

load data
whos X
[pcs,scrs,~,~,pexp] = pca(X)
pareto(pexp)
scatter(scrs(:,1),scrs(:,2))
20210404-160041-0678.png

20210404-165742-0857.png

2.2.10 Basketball Players

The statsNorm variable contains numeric statistics for several basketball players, normalized to have mean 0 and standard deviation 1.

Use classical multidimensional (CMD) scaling to find the reconstructed coordinates and corresponding eigenvalues for the data in statsNorm. Plot the Pareto chart of the eigenvalues.

%This code loads and formats the data.
data = readtable("bball.txt");
data.pos = categorical(data.pos);
%This code extracts and normalizes the columns of interest.
stats = data{:,[5 6 11:end]};	% Extract the 5 6 11 row in 'data' into the variable 'stats'.
statsNorm = normalize(stats);	% normalize the data.
% Task 1
D = pdist(statsNorm)
[Y,e] = cmdscale(D)
pareto(e)
scatter3(Y(:,1),Y(:,2),Y(:,3))
view(100,50)	% Change the view of the three-dimensional plot.
% Task 2
[~,scores,~,~,explained] = pca(statsNorm)
pareto(explained)
scatter3(scores(:,1),scores(:,2),scores(:,3))
view(100,50)
% Task 3
scatter3(Y(:,1),Y(:,2),-Y(:,3),10,data.pos)	% the third value is changed to negative.
c = colorbar;	% add colorbar to figure.
c.TickLabels = categories(data.pos);
scatter3(scores(:,1),scores(:,2),-scores(:,3),10,data.pos)
c = colorbar;
c.TickLabels = categories(data.pos);

20210404-175421-0112.png

the scatter plot of the CMD values

20210404-175622-0501.png

the scatter plot of the PCA values

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值