一、数据集拆分
method-01: split, using sklearn、pytorch、keras
from sklearn.model_selection import train_test_split
import numpy as np
sample_data = np.random.random(size=(100, 5))
# print(sample_data)
train, test = train_test_split(sample_data, train_size=0.8)
# print(train)
# method-02: split, numpy random: indices
indices = np.random.choice(range(len(sample_data)), size=int(0.8 * len(sample_data)), replace=True)
print(indices)
二、归一化(Normalization,缩放到一个指定范围,一般是0到1之间)
test_large_data = [123, 234, 46, 209, 345, 99, 560, 850]
test_small_data = [0.000023, 0.00056, 0.0043, 0.00094, 0.00013, 0.0049, 0.00082, 0.0031]
def normalize(x):
return (np.array(x) - np.min(x)) / (np.max(x) - np.min(x))
print(normalize(test_large_data))
print(normalize(test_small_data))
三、标准化(standardization)
从几何上理解就是,先将坐标轴零轴平移到均值这条线上,然后再进行一个缩放,涉及到的就是平移和缩放两个动作。这样处理以后的结果就是,对于每个属性(每列)来说,所有数据都聚集在0附近,方差为1。
test_stand_data = [24354651, 24354632, 24354613, 24354694, 24354625, 24354656, 24354687]
def standardize(x):
return (np.array(x) - np.mean(x)) / np.std(x)
print(standardize(test_stand_data))