在pandas里面有很丰富的api来处理数据,但是对于需要使用苹果Create ML来训练模型,并运用到ios或者macOS设备上面的用户来说,就没有这么多丰富的api来使用。机器学习样本理想的情况下倒是不需要做太多处理,但是实际样本很有可能会有很多缺失值,这个时候如果不对缺失值做处理,就根本无法进行模型的训练。
例子所用的测试数据来源:
使用MLDataTable加载训练数据
import Cocoa
import CreateML
let trainFile = Bundle.main.url(forResource: "train", withExtension: "csv")!
var trainData = try MLDataTable(contentsOf: trainFile)
手动处理数据
获取数据的分布
要手动计算众数,中位数就需要知道各种数据的分布。也就是要知道每个值有多少个,用一个很简单的循环遍历数据,然后再用字典统计即可。简单的示例代码如下(以LotFrontage这一列为例):
let TYPE_INT = 0
let TYPE_STRING = 1
let missing = "missing" // 用来记录缺失的值
func valueCounts(data: MLUntypedColumn, type: Int) -> [String: Int] {
var vc = [String:Int]()
for i in 0..
if data[i].isValid {
if type == TYPE_INT {
addItem(data: &vc, key: String(stringInterpolationSegment: data[i].intValue!))
} else if type == TYPE_STRING {
addItem(data: &vc, key: data[i].stringValue!)
}
} else {
addItem(data: &vc, key: missing)
}
}
return vc
}
let vc = valueCounts(data: trainData["LotFrontage"], type: TYPE_INT)
print(vc)
输出结果如下:
["32": 5, "30": 6, "68": 19, "61": 8, "118": 2, "84": 9, "50": 57, "24": 19, "110": 6, "59": 13, "49": 4, "45": 3, "96": 8, "51": 15, "85": 40, "21": 23, "56": 5, "95": 7, "74": 15, "98": 8, "78": 25, "75": 53, "79": 17, "100": 16, "46": 1, "104": 3, "86": 10, "missing": 259, "57": 12, "124": 2, "114": 2, "76": 11, "122": 2, "115": 2, "80": 69, "55": 17, "130": 3, "102": 4, "72": 17, "60": 143, "54": 6, "36": 6, "81": 6, "92": 10, "106": 1, "47": 5, "89": 6, "35": 9, "42": 4, "69": 11, "94": 6, "144": 1, "141": 1, "107": 7, "129": 2, "150": 1, "120": 7, "105": 6, "116": 2, "182": 1, "62": 9, "93": 8, "65": 44, "112": 1, "63": 17, "137": 1, "138": 1, "101": 2, "108": 3, "140": 1, "82": 12, "66": 15, "71": 12, "70": 70, "58": 7, "64": 19, "67": 12, "48": 6, "160": 1, "174": 2, "103": 3, "99": 3, "37": 5, "149": 1, "