The function knnimpute replaces NaNs in the input data with the corresponding value from the nearest-neighbor column. Consider the following matrix.
A = [1 2 5;4 5 7;NaN -1 8;7 6 0]
A = 4×3
1 2 5
4 5 7
NaN -1 8
7 6 0
A(3,1) is NaN, and because column 2 is the closest column to column 1 in the Euclidean distance, knnimpute replaces the (3,1) entry of column 1 with the corresponding entry from column 2, which is -1.
results = knnimpute(A)
results = 4×3
1 2 5
4 5 7
-1 -1 8
7 6 0
The data must have at least one row without any NaN values for knnimpute to work. If all rows have NaN values, you can add a row where every observation (column) has identical values and call knnimpute on the updated matrix to replace the NaN values with the average of all column values for a given row.
B = [NaN 2 1; 3 NaN 1; 1 8 NaN]
B = 3×3
NaN 2 1
3 NaN 1
1 8 NaN
B(4,:) = ones(1,3)
B = 4×3
NaN 2 1
3 NaN 1
1 8 NaN
1 1 1
imputed = knnimpute(B)
imputed = 4×3
1.5000 2.0000 1.0000
3.0000 2.0000 1.0000
1.0000 8.0000 4.5000
1.0000 1.0000 1.0000
You can then remove the added row.
imputed(4,:) = []
imputed = 3×3
1.5000 2.0000 1.0000
3.0000 2.0000 1.0000
1.0000 8.0000 4.5000
Load a sample biological data set and imputes missing values in yeastvalues,where each row represents each gene and each column represents an experimental condition or observation.
load yeastdata
Remove data for empty spots where gene labels are set to 'EMPTY'.
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
knnimpute uses the next nearest column if the corresponding value from the nearest-neighbor column is also NaN. However, if all columns are NaNs, the function generates a warning for each row and keeps the rows instead of deleting the whole row in the returned output. The sample data contains some rows with all NaNs. Remove those rows to avoid the warnings.
yeastvalues(~any(~isnan(yeastvalues),2),:) = [];
Impute missing values.
imputedData1 = knnimpute(yeastvalues);
Check if there any NaN left after imputing data.
sum(any(isnan(imputedData1),2))
ans = 0
Use the 5-nearest neighbor search to get the nearest column.
imputedData2 = knnimpute(yeastvalues,5);
Change the distance metric to use the Minknowski distance.
imputedData3 = knnimpute(yeastvalues,5,'Distance','minkowski');
You can also specify the parameter for the distance metric. For instance, specify a different exponent (say 5) for the Minknowski distance.
imputedData4 = knnimpute(yeastvalues,5,'Distance','minkowski','DistArgs',5);