文章目录
1.ALiPy
官网:http://parnec.nuaa.edu.cn/huangsj/alipy/
github:https://github.com/NUAA-AL/ALiPy
csdn:https://blog.csdn.net/weixin_44575152/article/details/100783835
2.modAL
官网:https://modal-python.readthedocs.io/en/latest/index.html
github:https://github.com/modAL-python/modAL
2.1 modAL基础教程
modAL是基于scikit-learn(机器学习的包)建立的一个框架!框架支持使用自带方法快速开始AL,也支持自主定义模型!目前modAL支持scikit-learn、keras以及pytorch模型!
2.1.1 基本流程
1.引包
2.加载数据
3.定义并初始化学习器,两个重要参数:
·模型(scikit-learn estimator)
·查询策略:若不指定查询策略(可自定义),则将采用默认的---> maximum uncertainty sampling.
4.查询出数据
5.人工标注
6.用查询出的未标注数据对模型进行再训练
上述流程中每个步骤中用到的方法的小综合:
https://modal-python.readthedocs.io/en/latest/content/models/ActiveLearner.html#initialization
#1.引包
from modAL.models import ActiveLearner
from modAL.uncertainty import entropy_sampling
from sklearn.ensemble import RandomForestClassifier
#2.加载数据
X_training="标注数据"
y_training="标注数据的标签"
X_pool="未标注数据"
#3.定义并初始化学习器,若不指定查询策略,则将采用默认的
learner = ActiveLearner(
#learner里包含模型!
estimator=RandomForestClassifier(),
#定义学习器使用的查询策略;若这里不定义,那么将采用学习器默认的查询策略
#query_strategy=entropy_sampling,
#若传入训练数据,初始化learner时会自动训练模型;若未传入,目前发现都会报错
X_training=X_training, y_training=y_training
)
#4.查询出数据
query_idx, query_inst = learner.query(X_pool)
#5.人工标注
#对查询出的数据进行标注并将标签放入y_new里
#6.用查询出的未标注数据对模型进行再训练
learner.teach(X_pool[query_idx], y_new)
note:关于上述ActiveLearner()中的X_training的一些看法:
X_training:
If the model hasn’t been fitted yet it is None, otherwise it contains the samples which the model has been trained on. If provided, the method fit() of estimator is called during init()
2.1.2 定义查询策略
采用默认查询策略
learner = ActiveLearner(
#learner里包含模型!
estimator=RandomForestClassifier(),
#定义学习器使用的查询策略;若这里不定义,那么将采用学习器默认的查询策略
#query_strategy=entropy_sampling,
#若传入训练数据,初始化learner时会自动训练模型;若未传入,目前发现都会报错
X_training=X_training, y_training=y_training
)
采用自定义查询策略
查询策略是在定义学习器learner里的,因此这里我们只需要改动learner就好
使用框架自带的策略来定义策略
使用随机抽样:
from modAL.models import ActiveLearner
from modAL.uncertainty import entropy_sampling
from sklearn.ensemble import RandomForestClassifier
learner = ActiveLearner(
estimator=RandomForestClassifier(),
#定义学习器使用的查询策略;若这里不定义,那么将采用学习器默认的查询策略
query_strategy=entropy_sampling,
X_training=X_training, y_training=y_training
)
完全自定义策略
query stategy in modAL is a function taking (at least) two arguments (an estimator object【训练的模型】 and a pool of examples【未标注数据集】), outputting the index of the queried instance and the instance itself
import numpy as np
#自己定义随机抽样/查询的函数
#classifier在函数中没有用到也要定义,X_pool为未标注数据集
def random_sampling(classifier, X_pool):
n_samples = len(X_pool)
query_idx = np.random.choice(range(n_samples))
return query_idx, X_pool[query_idx]
#query_strategy等于函数名即可
learner = ActiveLearner(
estimator=RandomForestClassifier(),
query_strategy=random_sampling,
X_training=X_training, y_training=y_training
)
采用高斯过程的回归实例
import numpy as np
X = np.random.choice(np.linspace(0, 20, 10000), size=200, replace=False).reshape(-1, 1)
y = np.sin(X) + np.random.normal(scale=0.3, size=X.shape)
#自定义查询策略函数
def GP_regression_std(regressor, X):
_, std = regressor.predict(X, return_std=True)
query_idx = np.argmax(std)
return query_idx, X[query_idx]
from modAL.models import ActiveLearner
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF
n_initial = 5
initial_idx = np.random.choice(range(len(X)), size=n_initial, replace=False)
X_training, y_training = X[initial_idx], y[initial_idx]
kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
+ WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))
regressor = ActiveLearner(
estimator=GaussianProcessRegressor(kernel=kernel),
query_strategy=GP_regression_std,
X_training=X_training.reshape(-1, 1), y_training=y_training.reshape(-1, 1)
)
#AL
n_queries = 10
for idx in range(n_queries):
query_idx, query_instance = regressor.query(X)
print(query_idx,query_instance)
regressor.teach(X[query_idx].reshape(1, -1), y[query_idx].reshape(1, -1))
62 [2.46024602]
36 [11.58315832]
147 [0.0080008]
152 [19.8359836]
70 [7.81478148]
10 [1.02210221]
148 [5.89658966]
21 [15.71157116]
193 [17.15971597]
167 [10.59905991]
自定义策略完全版
1.定义查询策略函数时必须要有两个参数(模型、未标注数据集);
2.查询策略函数由两个重要部分组成:1.数据的得分度量;2.数据选择
eg:
def custom_query_strategy(classifier, X, a_keyword_argument=42):
# measure the utility of each instance in the pool
utility = utility_measure(classifier, X)
# select the indices of the instances to be queried
query_idx = select_instances(utility)
# return the indices and the instances
return query_idx, X[query_idx]
得分度量部分
A utility measure takes a pool of examples and returns a one dimensional array containing the utility score for each example
框架的modAL.uncertainty本身自带了计算得分的函数,比如classifier_uncertainty, classifier_margin 以及 classifier_entropy;
因此我们可以直接使用框架自带的得分计算函数,也可以自己定义得分计算函数,还可以将框架自带的函数线性组合在一起!
线性组合得到得分度量函数
线性组合需要使用modAL.utils.combination这个类
from modAL.utils.combination import make_linear_combination, make_product
from modAL.uncertainty import classifier_uncertainty, classifier_margin
# 通过线性组合和乘机得到新的得分度量函数
# linear_combination will return 1.0*classifier_uncertainty + 1.0*classifier_margin
linear_combination = make_linear_combination(
classifier_uncertainty, classifier_margin,
weights=[1.0, 1.0]
)
# product will return (classifier_uncertainty**0.5)*(classifier_margin**0.1)
product = make_product(
classifier_uncertainty, classifier_margin,
exponents=[0.5, 0.1]
)
数据选择部分
在每个数据的得分计算出来后,查询策略就应该决定要选择哪些数据来标注。在框架的modAL.utils.selection模块里提供了两种方法:
multi_argmax(values, n_instances=1): selects the n_instances highest utility score
weighted_random(weights, n_instances=1): selects the instances by random, using the supplied weighting.
自定义策略实例
首先定义得分度量函数
from modAL.utils.combination import make_linear_combination, make_product
from modAL.uncertainty import classifier_uncertainty, classifier_margin
# 通过线性组合和乘机得到新的得分度量函数
# linear_combination will return 1.0*classifier_uncertainty + 1.0*classifier_margin
linear_combination = make_linear_combination(
classifier_uncertainty, classifier_margin,
weights=[1.0, 1.0]
)
# product will return (classifier_uncertainty**0.5)*(classifier_margin**0.1)
product = make_product(
classifier_uncertainty, classifier_margin,
exponents=[0.5, 0.1]
)
然后定义查询策略函数
from modAL.utils.selection import multi_argmax
def custom_query_strategy(classifier, X, n_instances=1):
#这里的得分度量使用了上面定义的linear_combination
utility = linear_combination(classifier, X)
#数据选择使用了框架自带的multi_argmax
query_idx = multi_argmax(utility, n_instances=n_instances)
return query_idx, X[query_idx]
定义学习器
custom_query_learner = ActiveLearner(
estimator="对数据进行分类的模型",
query_strategy=custom_query_strategy,
X_training=X_training, y_training=y_training
)
2.1.3 模型
scikit-learn模型(直接使用)
As long as your classifier follows the scikit-learn API, you can use it in your modAL workflow. (Really, all it needs is a .fit(X, y) and a .predict(X) method.) For instance, the ensemble model implemented in Committee can be given to an ActiveLearner.
from modAL.models import Committee
# initializing the learners
n_learners = 3
learner_list = []
for _ in range(n_learners):
learner = ActiveLearner(
estimator=GaussianProcessClassifier(1.0 * RBF(1.0)),
X_training=X_training, y_training=y_training,
bootstrap_init=True
)
learner_list.append(learner)
# assembling the Committee
committee = Committee(learner_list)
# ensemble active learner from the Committee
ensemble_learner = ActiveLearner(
estimator=committee
)
pytorch模型
使用未训练过的pytorch model
官方例子:1.先定义模型 2.使用NeuralNetClassifier实例化 3.传到ActiveLearner里
import torch
from torch import nn
from skorch import NeuralNetClassifier
#1. build class for the skorch API
class Torch_Model(nn.Module):
def __init__(self,):
super(Torch_Model, self).__init__()
self.convs = nn.Sequential(
nn.Conv2d(1,32,3),
nn.ReLU(),
nn.Conv2d(32,64,3),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Dropout(0.25)
)
self.fcs = nn.Sequential(
nn.Linear(12*12*64,128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128,10),
)
def forward(self, x):
out = x
out = self.convs(out)
out = out.view(-1,12*12*64)
out = self.fcs(out)
return out
#2. create the classifier
device = "cuda" if torch.cuda.is_available() else "cpu"
classifier = NeuralNetClassifier(Torch_Model,
criterion=nn.CrossEntropyLoss,
optimizer=torch.optim.Adam,
train_split=None,
verbose=1,
device=device)
#3. load data
import numpy as np
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
from torchvision.datasets import MNIST
mnist_data = MNIST('.', download=True, transform=ToTensor())
dataloader = DataLoader(mnist_data, shuffle=True, batch_size=60000)
X, y = next(iter(dataloader))
# read training data
X_train, X_test, y_train, y_test = X[:50000], X[50000:], y[:50000], y[50000:]
X_train = X_train.reshape(50000, 1, 28, 28)
X_test = X_test.reshape(10000, 1, 28, 28)
# assemble initial data
n_initial = 1000
initial_idx = np.random.choice(range(len(X_train)), size=n_initial, replace=False)
X_initial = X_train[initial_idx]
y_initial = y_train[initial_idx]
# generate the pool
# remove the initial data from the training dataset
X_pool = np.delete(X_train, initial_idx, axis=0)[:5000]
y_pool = np.delete(y_train, initial_idx, axis=0)[:5000]
#4.build activelearner
from modAL.models import ActiveLearner
# initialize ActiveLearner
learner = ActiveLearner(
estimator=classifier,
X_training=X_initial, y_training=y_initial,
)
#5. query data
# the active learning loop
n_queries = 10
for idx in range(n_queries):
print('Query no. %d' % (idx + 1))
query_idx, query_instance = learner.query(X_pool, n_instances=100)
#use new data to train model
learner.teach(
X=X_pool[query_idx], y=y_pool[query_idx], only_new=True,
)
# remove queried instance from pool
X_pool = np.delete(X_pool, query_idx, axis=0)
y_pool = np.delete(y_pool, query_idx, axis=0)
加载训练过的model进行AL
定义模型
class Torch_Model(nn.Module):
def __init__(self,vocab_size,embedding_dim,hidden_dim,output_dim,n_layers=2,bidirectional=True,dropout=0.2,pad_idx=0):
super().__init__()
self.embedding=nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
#bidirectional为True,代表是双向的lstm网络
self.lstm=nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, bidirectional=bidirectional)
self.linear = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text):
embedded = self.dropout(self.embedding(text))
output, (hidden, cell) = self.lstm(embedded)
hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
return self.linear(hidden.squeeze(0))
加载模型
vocabulary_size = len(text_field.vocab)
embedding_dim = text_field.vocab.vectors.size()[-1]
class_num = len(label_field.vocab)
device = "cuda" if torch.cuda.is_available() else "cpu"
best_model=Torch_Model(vocabulary_size,embedding_dim,hidden_dim=128,output_dim=class_num)
best_model=best_model.to(device)
best_model.load_state_dict(torch.load("model/newsLSTM.pth"))
NeuralNetClassifier实例化
classifier = NeuralNetClassifier(best_model,
criterion=nn.CrossEntropyLoss,
optimizer=torch.optim.Adam,
train_split=None,
verbose=1,
device=device)
定义学习器
from modAL.models import ActiveLearner
# initialize ActiveLearner
learner = ActiveLearner(
estimator=classifier,
X_training=X_initial, y_training=y_initial
)
使用学习器查询数据
query_idx, query_instance = learner.query(X_pool, n_instances=100)
print(query_idx)
2.2 具体方法介绍
在第一节modAL的中已经讲解了大体流程,在本节中就是将流程的每个步骤中的方法挑出来介绍
学习器的查询策略
学习器的默认查询策略是maximum uncertainty sampling,就是挑选uncertainty sampling得分最大的实例。
那么框架自带的计算得分函数有3种:
modAL.uncertainty.classifier_entropy()
·信息熵
modAL.uncertainty.classifier_margin()
·最优类与次优类的概率差值
modAL.uncertainty.classifier_uncertainty()
·挑选出实例中(1-分类概率最大)的最大的实例
三个函数具体作用介绍:https://modal-python.readthedocs.io/en/latest/content/query_strategies/uncertainty_sampling.html
三个函数参数介绍:https://modal-python.readthedocs.io/en/latest/content/apireference/uncertainty.html
查询出数据后对模型再训练
框架自带有两种再训练方法:
1.fit()
·会从头开始训练模型
·会忘记以前所有的训练数据,仅用fit参数里提供的数据进行训练。尽量不用!
2.teach()
·除了提供新数据X以及其y,还有一个重要参数是only_new ;
·only_new=True:只对新获取的数据进行训练
·only_new=False:使用新数据扩充原本的训练数据集,并再训练
网址:https://modal-python.readthedocs.io/en/latest/content/models/ActiveLearner.html#initialization
2.3 存疑
1.加载训练过的pytorch模型,然后写入学习器中,在初始化学习器的过程中,加载好的模型参数是否会初始化?
就此疑问询问过框架开发者,但目前并没有得到回复!然后经过自己的实验,觉得模型参数初始化可能性不大,但并不绝对!
2.teach(only_new=True)以及fit()忘记以前的训练数据,使用新数据从头训练,那么模型参数是否也会忘记?