Machine Learning with MATLAB 1.1 to 2.2
1.1 Course Overview
1.2 Review - Machine Learning Onramp
Two files contain data for a selection of basketball players.
bballPlayers.txt | bballStats.txt | |
---|---|---|
This file contains information about each player, like their position, height, and weight. | This file contains player statistics for each year, such as games played, points scored, and rebounds. |
In this lesson, you will:
- Import and format the data stored in the files.
- Group statistics by player and merge the data sets into a single table.
- Visualize various player statistics and explore features.
- Train a classification model to predict a player’s position.
- Evaluate the model’s performance.
1.2.1 Import Data
positions = ["G","G-F","F-G","F","F-C","C-F","C"]
Task 1
To bring data from a file named dataFile.txt
into MATLAB as a table named T
, you can use the readtable
function.
playerInfo = readtable("bball")
Task 2
When text labels are intended to represent a finite set of possibilities, such as a player’s position, it’s more suitable to store the data as a categorical array.
You can use the categorical
function to convert an array of text labels
to a categorical array.
playerInfo.pos = categorical(playerInfo.pos) %convert an array of text to a categorical array.
The function categories
returns a list of all possible categories in a categorical array.
categories(playerInfo.pos) %output: ["F-C-G","F-G-C","G","G-F","F-G","F","F-C","C-F","C"]
Task 3
Sometimes you may want to define the categories for your categorical data. For example, if you are pulling data from multiple sources and not all the categories are represented in that particular set.
You can use a string array of category names as a second input to the categorical
function to specify the categories.
playerInfo.pos = categorical(playerInfo.pos,positions)
Use the categories defined in the string array positions
to convert the data in playerInfo.pos
into a categorical.
categories(playerInfo.pos) %output:["G","G-F","F-G","F","F-C","C-F","C"]
Other classifications are forced to convert to <undefined>
.
Task 4
Any data not specified by a category in positions
becomes <undefined>
.
To remove rows from a table which contain an undefined or missing value, you can use the rmmissing
function.
playerInfo.pos = rmmissing(playerInfo.pos)
Task 5
allStats = readtable("bballStats.txt")
Task 6
You can remove rows or variables from a table by first selecting those rows or variables and then assigning to them an empty array.
For example, the following command remove everything after the 18th column in allStats
.
allStats(:,19:end) = []
1.2.3 Group and Merge Data
Task 1
The groupsummary
function performs grouped calculations. For example, the following command calculates the standard deviation (std
) of the data in data
, grouped by data.Label
.
stdevData = groupsummary(data,"Label","std")
Create a table named playerStats
which calculates the sum of the data in allStats
grouped by "playerID"
.
You may leave off the semicolon to view the output.
playerStats = groupsummary(allStats,"playerID","sum")
Task 2
The table playerStats
has variable names similar to those in allStats
, but prepended with sum_
. It also has an additional variable named GroupCount
.
You can access the variable names in a table using the VariableNames
property of the Properties
of the table.
Remember you can remove a variable from a table by assigning it the empty array []
.
table.Properties.VariableNames
table.variable = []
Remove the variable GroupCount
from playerStats
. Then replace the variable names in playerStats
with the variable names in allStats
.
playerStats.GroupCount = [];
allStats.Properties.VariableNames = playerStats.Properties.VariableNames
Task 3
You can combine, or join, two tables by matching up rows with the same key variable values. The key variable playerID
can be used to join the data from playerInfo
and playerStats
.
The innerjoin
function joins two tables, and includes only observations whose key variable values appear in both tables.
Create a table named data
which joins playerInfo
and playerStats
, and includes only players who appear in both tables.
data = innerjoin(playerInfo,playerStats)
Further Practice
>> Join Table
This command allow you to operate in a visual window.
1.2.4 Explore Data
Task 1
You can plot some results from the basketball player data to explore relationships between various statistics and player position.
Create a box plot of player height for each position.
boxplot(data.height,data.pos)
ylabel("Height (inches)")
Task 2
It looks as though guards (G) are generally shorter than forwards (F) or centers ©. What other patterns can we find in the data?
You can use gscatter
to explore the relationship between two variables, grouped by position.
Plot points against rebounds, grouped by position.
gscatter(data.rebounds,data.points,data.pos)
Task 3
You can see some groupings for different positions. However, the data’s range makes it difficult to compare players, since most players are clustered tightly around the origin.
Some players played much more than others, leading to higher total points and rebounds. One way to account for this difference is to divide points and rebounds by the number of games played.
In the table data
, the variable GP
contains the number of games played. You can use element-wise division (./
) to calculate per game statistics.
Plot points per game against rebounds per game, grouped by player position.
gscatter(data.rebounds./data.GP,data.points./data.GP,data.pos)
Task 4
The data points are no longer clustered around the origin, but the data points for each position are still spread out. Can a different normalization yield more insight?
Another way to account for difference in play time is to divide by the number of minutes played, data.minutes
.
Plot points per minute against rebounds per minute, grouped by player position.
gscatter(data.rebounds./data.minutes,data.points./data.minutes,data.pos)
1.2.5 Train a Model and Make Predictions
Task 1
The normalized (per minute) numeric statistics from the basketball player data set has been divided into a training set dataTrain
and a testing set dataTest
. You will train a classification model using the training set, then make predictions for the testing set.
A k-nearest neighbor (kNN) model classifies an observation as the same class as the nearest known examples. You can fit a kNN model by passing a table of data to the fitcknn
function.
The second input is the name of the response variable in the table (that is, the variable you want the model to predict). The output is a variable containing the fitted model.
Fit a kNN model to the data stored in dataTrain
. The known classes are the player positions, stored in the variable named "pos"
. Store the fitted model in a variable called knnmodel
.
knnmodel = fitcknn(dataTrain,"pos")
Task 2
The predict
function determines the predicted class of new observations.
The inputs are the trained model and a table of observations, with the same predictor variables as was used to train the model. The output is a categorical array of the predicted class for each observation in newdata
.
Predict the positions for the data in dataTest
. Store the predictions in a variable called predPos
.
predPos = predict(knnmodel,dataTest)
Task 3
How well did the kNN model predict player position?
A commonly-used metric to evaluate a model is the misclassification rate (the proportion of incorrect predictions). This metric is also called the model’s loss.
You can use the loss
function to calculate the misclassification rate for a data set.
Calculate the misclassification rate for dataTest
, and assign the result to the variable mdlLoss
.
mdlLoss = loss(knnmodel,dataTest)
output : mdlloss = 0.63224826
The loss value calculated using the previous method will be slightly different from the value calculated by this method:
allwrong = sum(predPos ~= dataTest.pos)
rate = allwrong / numel(predPos)
output : 0.63186813
Task 4
The loss value indicates that over 60% of the positions were predicted incorrectly. Did the model misclassify some positions more than others?
A confusion matrix gives the number of observations from each class that are predicted to be each class. It’s commonly visualized by shading the elements according to their value, with the diagonal elements (the correct classifications) shaded in one color and the other elements (the incorrect classifications) in another color. You can visualize a confusion matrix using the confusionchart
function.
confusionchart(ytrue,ypred);
ytrue
is a vector of the known classes and ypred
is a vector of the predicted classes.
The table dataTest
contains the known player positions, which you can compare with the predicted positions, predPos
.
Use the confusionchart
function to compare predPos
to the known labels (stored in dataTest.pos
).
confusionchart(dataTest.pos,predPos)
1.2.6 Evaluate the Model and Iterate
Task 1
By default, fitcknn
fits a kNN model with k = 1. That is, the model uses the class of the single closest “neighbor” to classify a new observation.
The model’s performance may improve if the value of k is increased – that is, it uses the most common class of several neighbors, instead of just one.
You can change the value of k by setting the "NumNeighbors"
property when calling fitcknn
.
mdl = fitcknn(table,"ResponseVariable", ...
"NumNeighbors",7)
Modify the fitcknn
function call on line 3. Set the "NumNeighbors"
property to 5.
knnmodel = fitcknn(dataTrain,"pos","NumNeighbors",5);
Task 2
Using 5 nearest neighbors reduced the loss, but the model still misclassifies over 50% of the test data set.
Many machine learning methods use the distance between observations as a similarity measure. Smaller distances indicate more similar observations.
In the basketball data set, the statistics have different units and scales, which means some statistics will contribute more than others to the distance calculation. Centering and scaling each statistic makes them contribute more evenly.
By setting the "Standardize"
property to true
in the fitcknn
function, each column of predictor data is normalized to have mean 0 and standard deviation 1, then the model is trained using the standardized data.
Modify line 3 again. Add to the fitcknn
function call to also set the "Standardize"
property to true
knnmodel = fitcknn(dataTrain,"pos","NumNeighbors",5);
1.2.7 Course Quick Reference
2.1 Course Example - Grouping Basketball Players
2.2 Low Dimensional Visualization
2.2.3 Multidimensional Scaling
dimension : 维度
approximate representation : 近似表示方法
Principle Component Analysis(PCA) and Classical Multi-dimensional Scaling : 主成分分析(PCA)和经典的多维缩放法
orthogonal coordinate system : 正交坐标系
Euclidean distance : 欧氏距离
Manhattan Distance (City block) : 曼哈顿距离
eigenvalues : 特征值
2.2.4 Classical Multidimensional Scaling
Task 1 Calculate pairwise distances
The matrix X
has 4 columns, and therefore would be best visualized using 4 dimensions. In this activity, you will use multidimensional scaling to visualize the data using fewer dimensions, while retaining most of the information it contains.
load data
whos X
Name | Size | Byte | Class | Attributes |
---|---|---|---|---|
X | 124*4 | 3968 | double |
To perform multidimensional scaling, you must first calculate the pairwise distances between observations. You can use the pdist
function to calculate these distances.
distances = pdist(data,"distance");
Calculate the pairwise distances between rows of X
, and name the result D
.
D = pdist(X,"distance")
output :
The output D
demonstrate a distance or dissimilarity vector containing the distance between each pair of observations.
D is of length 124(124-1)/2
The input X is an 124-by-4 numeric matrix containing the data. Each of the 124 rows is considered an observation.
The optional input “distance” indicates the method of caculating the distance or dissimilarity. Commonly used methods are:
"euclidean" %(default)
"cityblock"
"correlation"
Task 2 Perform multidimensional scaling
The cmdscale
function finds a configuration matrix and its corresponding eigenvalues for a set of pairwise distances.
[configMat,eigVal] = cmdscale(distances);
Find the configuration matrix and its eigenvalues for the distance matrix D
, and name them Y
and e
, respectively.
[Y,e] = cmdscale(D)
output :
The cmdscale is that Classical MultiDimensional Scaling.
The input D
is a distance or dissimilarity vector.
The output Y
is a 124-by-q matrix of the reconstructed coordinates in q-dimensional space.
q
is the minimum number of dimensions needed to achieve the given pairwise distances.
e
is the Eigenvalues of the matrix x*x'
.
You can use the eigenvalues e
to determine if a low-dimensional approximation to the points in x
provides a reasonable representation of the data. If the first p eigenvalues are significantly larger than the rest, the points are well approximated by the first p dimensions (that is, the first p columns of x
).
Task 3 Visualizes relative magnitudes of vector
You can use the pareto
function to create a Pareto chart, which visualizes relative magnitudes of a vector in descending order.
pareto(vector)
Create a Pareto chart of the eigenvalues e
.
pareto(e)
In this result, the first 3 columns of e
are distinctly larger than others, so we can retain them and not losing too much information of raw data.
Task 4 Create a scatter of the first two columns of a matrix M
From the Pareto chart, you can see that over 90% of the distribution is described with just two variables.
You can use the scatter
function to create a scatter plot of the first two columns of a matrix M
.
scatter(M(:,1),M(:,2))
Use scatter
to create a scatter plot of the first two columns of Y
.
scatter(Y(:,1),Y(:,2))
Task 5 Creates a three-dimensional scatter plot
From the Pareto chart, notice that 100% of the distribution is described with three variables.
The scatter3
function creates a three-dimensional scatter plot. You can use scatter3
to create a scatter plot of three columns of a matrix M
.
scatter3(M(:,1),M(:,2),M(:,3))
Use scatter3
to create a scatter plot of the first three columns of Y
.
scatter3(Y(:,1),Y(:,2),Y(:,3))
Different view of the three-dimensional scatter plot.
2.2.5 Nonclassical Multidimensional Scaling
When you use the cmdscale
function, it determines how many dimensions are returned in the configuration matrix.
To find a configuration matrix with a specified number of dimensions, you can use the mdscale
function.
In fact, cmdscale() alow you to determine the number of dimensions return to a variable, too.
configMat = mdscale(distances,numDims);
Calculate the pairwise distance between rows of X
, and name it D
. Then find the configuration matrix in 2 dimensions of the distances and name it Y
.
load data
whos X
%%%%%%%%%%
D = pdist(X)
Y = cmdscale(D,2)
% Y = mdscale(D,2)
scatter(Y(:,1),Y(:,2))
2.2.6 Principal Component Analysis (PCA)
Another commonly used method for dimensionality reduction is principal component analysis (PCA). Use the function pca
to perform principal component analysis.
[pcs,scrs,~,~,pexp] = pca(data)
pca() takes the raw observations as input.
But cmdscale() and mdscale() requires a distance array as input.
load data
whos X
[pcs,scrs,~,~,pexp] = pca(X)
pareto(pexp)
scatter(scrs(:,1),scrs(:,2))
2.2.10 Basketball Players
The statsNorm
variable contains numeric statistics for several basketball players, normalized to have mean 0 and standard deviation 1.
Use classical multidimensional (CMD) scaling to find the reconstructed coordinates and corresponding eigenvalues for the data in statsNorm
. Plot the Pareto chart of the eigenvalues.
%This code loads and formats the data.
data = readtable("bball.txt");
data.pos = categorical(data.pos);
%This code extracts and normalizes the columns of interest.
stats = data{:,[5 6 11:end]}; % Extract the 5 6 11 row in 'data' into the variable 'stats'.
statsNorm = normalize(stats); % normalize the data.
% Task 1
D = pdist(statsNorm)
[Y,e] = cmdscale(D)
pareto(e)
scatter3(Y(:,1),Y(:,2),Y(:,3))
view(100,50) % Change the view of the three-dimensional plot.
% Task 2
[~,scores,~,~,explained] = pca(statsNorm)
pareto(explained)
scatter3(scores(:,1),scores(:,2),scores(:,3))
view(100,50)
% Task 3
scatter3(Y(:,1),Y(:,2),-Y(:,3),10,data.pos) % the third value is changed to negative.
c = colorbar; % add colorbar to figure.
c.TickLabels = categories(data.pos);
scatter3(scores(:,1),scores(:,2),-scores(:,3),10,data.pos)
c = colorbar;
c.TickLabels = categories(data.pos);
the scatter plot of the CMD values
the scatter plot of the PCA values