matlab knnimpute函数,Impute missing data using nearest-neighbor method

The function knnimpute replaces NaNs in the input data with the corresponding value from the nearest-neighbor column. Consider the following matrix.

A = [1 2 5;4 5 7;NaN -1 8;7 6 0]

A = 4×3

1 2 5

4 5 7

NaN -1 8

7 6 0

A(3,1) is NaN, and because column 2 is the closest column to column 1 in the Euclidean distance, knnimpute replaces the (3,1) entry of column 1 with the corresponding entry from column 2, which is -1.

results = knnimpute(A)

results = 4×3

1 2 5

4 5 7

-1 -1 8

7 6 0

The data must have at least one row without any NaN values for knnimpute to work. If all rows have NaN values, you can add a row where every observation (column) has identical values and call knnimpute on the updated matrix to replace the NaN values with the average of all column values for a given row.

B = [NaN 2 1; 3 NaN 1; 1 8 NaN]

B = 3×3

NaN 2 1

3 NaN 1

1 8 NaN

B(4,:) = ones(1,3)

B = 4×3

NaN 2 1

3 NaN 1

1 8 NaN

1 1 1

imputed = knnimpute(B)

imputed = 4×3

1.5000 2.0000 1.0000

3.0000 2.0000 1.0000

1.0000 8.0000 4.5000

1.0000 1.0000 1.0000

You can then remove the added row.

imputed(4,:) = []

imputed = 3×3

1.5000 2.0000 1.0000

3.0000 2.0000 1.0000

1.0000 8.0000 4.5000

Load a sample biological data set and imputes missing values in yeastvalues,where each row represents each gene and each column represents an experimental condition or observation.

load yeastdata

Remove data for empty spots where gene labels are set to 'EMPTY'.

emptySpots = strcmp('EMPTY',genes);

yeastvalues(emptySpots,:) = [];

knnimpute uses the next nearest column if the corresponding value from the nearest-neighbor column is also NaN. However, if all columns are NaNs, the function generates a warning for each row and keeps the rows instead of deleting the whole row in the returned output. The sample data contains some rows with all NaNs. Remove those rows to avoid the warnings.

yeastvalues(~any(~isnan(yeastvalues),2),:) = [];

Impute missing values.

imputedData1 = knnimpute(yeastvalues);

Check if there any NaN left after imputing data.

sum(any(isnan(imputedData1),2))

ans = 0

Use the 5-nearest neighbor search to get the nearest column.

imputedData2 = knnimpute(yeastvalues,5);

Change the distance metric to use the Minknowski distance.

imputedData3 = knnimpute(yeastvalues,5,'Distance','minkowski');

You can also specify the parameter for the distance metric. For instance, specify a different exponent (say 5) for the Minknowski distance.

imputedData4 = knnimpute(yeastvalues,5,'Distance','minkowski','DistArgs',5);

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值