原型工具开发:基于主动学习的可追踪性自动化生成

数据集的下载地址:https://download.csdn.net/download/A1342772/12200967

源代码下载地址:https://download.csdn.net/download/A1342772/12201064

weka输入样式下载地址:https://download.csdn.net/download/A1342772/12201111

1 构建数据集

对于给定项目中的源制品集S1和目标制品集S2 ,可追踪性生成通常要求开发人员在S1XS2条可能追踪链中识别出有效追踪链。基于主动学习(Active Learning,AL) 的方法使用项目中已有的追踪链训练分类器,然后使用该分类器识别有效的追踪链。为了训练分类器,需要建立可以体现追踪链特性的训练特征,主要包括基于信息检索(Information Retrieval,IR )的特征和查询质量(Query Quality,QQ)特征。

(1)基于IR的特征

通过IR 技术可以捕获制品之间的文本相似度,并生成按照相似度排名的候选追踪链列表。可能追踪链在列表中的排名在一定程度上反映了其有效性。因此,可能追踪链的排名可以作为第一组特征。

(2)QQ特征

QQ指标在之前的工作中被用于评估查询的质量,这些指标在之前的工作中被用于评估查询的质量。之所以使用这些指标是因为它们补充了基于IR 的特征,并且可以为分类器提供有关两个制品之间追踪链的更多上下文信息。下表展示了这些指标详细的计算公式。

2 构建初始训练集 

给定系统的源制品集S1 和目标制品集S2 之间存在S1XS2 条可能追踪链,这些可能追踪链形成一个样本集D。从D中随机选择一定数量的样本进行手动标注构成标注样本集DL。此时, DL就是初始训练集 。由于初始训练集中负样本的数量远大于正样本的数量,因此需要利用再平衡技术平衡初始训练数据。

(1)构建初始训练集

主动学习初始训练集设为数据集大小的3%

import sys
import pandas as pd
from sklearn.model_selection import train_test_split
import sys
if __name__ == "__main__":
    data = pd.read_csv("feature.csv")
    # 将样本分为x表示特征,y表示类别
    x, y = data.ix[:,:34], data.ix[:,[34]]
    # 测试集为30%,训练集为700
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.97, random_state=2)
    os_features=pd.DataFrame(x_train)
    os_labels=pd.DataFrame(y_train)
    train_samples=pd.concat([os_features,os_labels],axis=1)
    os_features=pd.DataFrame(x_test)
    os_labels=pd.DataFrame(y_test)
    test_samples=pd.concat([os_features,os_labels],axis=1)
    train_samples.to_csv("train.csv",index=False,header=None)
    test_samples.to_csv("test.csv",index=False,header=None)

(2)SMOTE平衡数据(weka提供了这个功能,可以不使用该代码

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
data=pd.read_csv('feature1.csv')
columns=data.columns
features_columns=columns.delete(len(columns)-1)
features=data[features_columns]
labels=data['Class']
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features,labels)
len(os_labels[os_labels==1])
os_features=pd.DataFrame(os_features)
os_labels=pd.DataFrame(os_labels)
train_samples=pd.concat([os_features,os_labels],axis=1)
train_samples.to_csv("train_samples.csv")

3 利用主动学习训练分类器

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# Set our RNG seed for reproducibility.
RANDOM_STATE_SEED =123
np.random.seed(RANDOM_STATE_SEED)
import csv
import math
from sklearn.datasets import load_iris
feature=[]
train_features=[]
train_labels=[]
#设置初始训练集的大小
with open('train.csv','r') as csvfile:  
   reader = csv.reader(csvfile)
   rows= [row for row in reader]
   for row in rows:
      for i in range(len(row)-1):
            feature.append(row[i])            
      train_features.append(feature)
      feature=[]
      train_labels.append(row[len(row)-1])
add_rows=len(rows)
print("add_rows=",add_rows)
# loading the iris dataset
train_features=np.array(train_features)
train_labels=np.array(train_labels)

test_features=[]
test_labels=[]
with open('test.csv','r') as csvfile:  
   reader = csv.reader(csvfile)
   rows= [row for row in reader]
   for row in rows:
      for i in range(len(row)-1):
            feature.append(row[i])            
      test_features.append(feature)
      feature=[]
      test_labels.append(row[len(row)-1])
# loading the iris dataset
test_features=np.array(test_features)
test_labels=np.array(test_labels)

#iris = load_iris()
X_raw =test_features #iris['data']
y_raw =test_labels#iris['target']
from sklearn.decomposition import PCA

# Define our PCA transformer and fit it onto our raw dataset.
pca = PCA(n_components=2, random_state=RANDOM_STATE_SEED)
transformed_iris = pca.fit_transform(X=X_raw)
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# Isolate the data we'll need for plotting.
x_component, y_component = transformed_iris[:, 0], transformed_iris[:, 1]

# Plot our dimensionality-reduced (via PCA) dataset.
plt.figure(figsize=(8.5, 6), dpi=130)
plt.scatter(x=x_component, y=y_component, c=y_raw, cmap='viridis', s=50, alpha=8/10)
plt.title('Iris classes after PCA transformation')
plt.show()
# Isolate our examples for our labeled dataset.
n_labeled_examples = X_raw.shape[0]
#training_indices = np.random.randint(low=0, high=n_labeled_examples+1, size=109)
X_train = train_features #X_raw[training_indices]
y_train = train_labels #y_raw[training_indices]
#os_features=pd.DataFrame(X_train)
#os_labels=pd.DataFrame(y_train)
#train_samples=pd.concat([os_features,os_labels],axis=1)
#train_samples.to_csv("train_samples.csv")
# Isolate the non-training examples we'll be querying.
X_pool = test_features#np.delete(X_raw, training_indices, axis=0)
y_pool = test_labels#np.delete(y_raw, training_indices, axis=0)
from sklearn.neighbors import KNeighborsClassifier
from modAL.models import ActiveLearner

# Specify our core estimator along with it's active learning model.
randomForest = RandomForestClassifier()
learner = ActiveLearner(estimator=randomForest, X_training=X_train, y_training=y_train)
# Isolate the data we'll need for plotting.
predictions = learner.predict(X_raw)
is_correct = (predictions == y_raw)

predictions
# Record our learner's score on the raw data.
unqueried_score = learner.score(X_raw, y_raw)

# Plot our classification results.
fig, ax = plt.subplots(figsize=(8.5, 6), dpi=130)
ax.scatter(x=x_component[is_correct],  y=y_component[is_correct],  c='g', marker='+', label='Correct',   alpha=8/10)
ax.scatter(x=x_component[~is_correct], y=y_component[~is_correct], c='r', marker='x', label='Incorrect', alpha=8/10)
ax.legend(loc='lower right')
ax.set_title("ActiveLearner class predictions (Accuracy: {score:.3f})".format(score=unqueried_score))
plt.show()
N_QUERIES =add_rows
performance_history = [unqueried_score]

# Allow our model to query our unlabeled dataset for the most
# informative points according to our query strategy (uncertainty sampling).
features=[]
labels=[]
for index in range(N_QUERIES):
  query_index, query_instance = learner.query(X_pool)
  features.append(X_pool[query_index][0])
  labels.append(y_pool[query_index][0])
  # Teach our ActiveLearner model the record it has requested.
  X, y = X_pool[query_index].reshape(1, -1), y_pool[query_index].reshape(1, )
  learner.teach(X=X, y=y)
  # Remove the queried instance from the unlabeled pool.
  X_pool, y_pool = np.delete(X_pool, query_index, axis=0), np.delete(y_pool, query_index)
  # Calculate and report our model's accuracy.
  model_accuracy = learner.score(X_raw, y_raw)
  print('Accuracy after query {n}: {acc:0.4f}'.format(n=index + 1, acc=model_accuracy))

  # Save our model's performance for plotting.
  performance_history.append(model_accuracy)
os_features=pd.DataFrame(features)
os_labels=pd.DataFrame(labels)
add_samples=pd.concat([os_features,os_labels],axis=1)
add_samples.to_csv("add_samples.csv")
os_features=pd.DataFrame(X_pool)
os_labels=pd.DataFrame(y_pool)
test_samples=pd.concat([os_features,os_labels],axis=1)
test_samples.to_csv("test_samples.csv")

4 使用分类器对未标注的可能追踪链进行分类

weka使用步骤如下:

步骤一:调整数据集的样式

@relation eanci_CCUC_normalize-weka.filters.supervised.instance.StratifiedRemoveFolds-S0-N10-F2

@attribute cc_Rank_tfidf numeric
@attribute uc_Rank_tfidf numeric
@attribute cc_avgIdf numeric
@attribute cc_maxIdf numeric
@attribute cc_devIdf numeric
@attribute cc_avgIctf numeric
@attribute cc_maxIctf numeric
@attribute cc_devIctf numeric
@attribute cc_avgEntropy numeric
@attribute cc_medEntropy numeric
@attribute cc_maxEntropy numeric
@attribute cc_devEntropy numeric
@attribute cc_qs numeric
@attribute cc_scs numeric
@attribute cc_avgVar numeric
@attribute cc_maxVar numeric
@attribute cc_sumVar numeric
@attribute cc_cs numeric
@attribute uc_avgIdf numeric
@attribute uc_maxIdf numeric
@attribute uc_devIdf numeric
@attribute uc_avgIctf numeric
@attribute uc_maxIctf numeric
@attribute uc_devIctf numeric
@attribute uc_avgEntropy numeric
@attribute uc_medEntropy numeric
@attribute uc_maxEntropy numeric
@attribute uc_devEntropy numeric
@attribute uc_qs numeric
@attribute uc_scs numeric
@attribute uc_avgVar numeric
@attribute uc_maxVar numeric
@attribute uc_sumVar numeric
@attribute uc_cs numeric
@attribute IsLinkValid {1,0}

@data
2.070979	3.401197	1.368211	1.682427	3.401197	1.797134	0.826321	0.969023	0.969023	0.239529	1	-3.93082	0.002011	0.012221	0.084466	0.3023	17	2.182807	3.850148	1.19499	1.666404	3.850148	1.504016	0.563447	0.603326	0.603326	0.341622	0.978723	-8.739475	0.005993	0.030301	0.503405	0.216804	20	0

步骤二:打开weka ,点击Explorer按钮,进入主界面

步骤三:选择训练数据,并且对训练数据进行平衡。

步骤四:选择分类算法,选择测试集合,查看结果

 5 结果分析

 通过混淆矩阵计算最终的结果(查全率,查准率,F-score)

(1)混淆矩阵

 (2)计算查准率,查全率,F-socre

 

#混淆矩阵
#TP FP
#FN TN
TP = 136
FP = 27
FN = 14
TN = 2068
#查准率,查全率,F-score
precision = TP/(TP+FP)
recall = TP/(TP+FN)
fscore = 2*(precision*recall)/(precision+recall)

print("precision:%f,recall:%f,F-score:%f"%(precision,recall,fscore))

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值