根据哈希值分割train/validation set

最新推荐文章于 2025-04-06 14:52:25 发布

Jim_Sun_Jing

最新推荐文章于 2025-04-06 14:52:25 发布

阅读量401

点赞数

分类专栏： Deep Learning 文章标签： data cleaning

本文链接：https://blog.csdn.net/Jim_Sun_Jing/article/details/103218691

版权

Deep Learning 专栏收录该内容

3 篇文章

订阅专栏

train set 和 val set 的分隔可以用numpy实现：

import numpy as np
np.random.seed(42)
# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

设置random seed可以保证每次的train/val set都保持一样。
但在添加了新数据/修改原文件后，这个方法就不灵了，会导致val set和train set互串，最终有一部分val set会被算法看到。

Hands-on Machine Learning 中提供一个方法避免了以上的情况发生：给每个row算一个哈希值，根据哈希值分隔。

第一种：

import hashlib

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

第二种：

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

分割：

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")