python交叉输出_如何使用来自交叉验证输出的最佳模型使用keras model.predict（）？...

最新推荐文章于 2024-01-28 14:32:27 发布

weixin_39992760

最新推荐文章于 2024-01-28 14:32:27 发布

阅读量905

点赞数

文章标签： python交叉输出

我有一个像这样的DNA序列数据框：

FeatureLabelGCTAGATGACAGT0TTTTAAAACAG1TAGCTATACT2TGGGGCAAAAAAAA0AATGTCG3AATGTCG0AATGTCG1

如果有一列带有DNA序列，并且标记可以是0、1、2、3(即该DNA序列的类别)。我想开发一种可预测将每个序列分为1,2或3类的概率的NN(不是0，我不在乎0)。每个序列可以在数据框中出现多次，并且每个序列有可能出现在多个(或所有)类别中。因此输出应如下所示：

GCTAGATGACAGT(0.9,0.1,0.2)TTTTAAAACAG(0.7,0.6,0.3)TAGCTATACT(0.3,0.3,0.2)TGGGGCAAAAAAAA(0.1,0.5,0.6)

元组中的数字是在类别1,2和3中找到该序列的概率。

这是我的代码：

importnumpyfromkeras.datasetsimportimdbfromkeras.modelsimportSequentialfromkeras.layersimportDensefromkeras.layersimportLSTMfromkeras.layersimportDropoutfromkeras.layers.embeddingsimportEmbeddingfromkeras.preprocessingimportsequencefromsklearn.model_selectionimportStratifiedKFoldfromkeras.callbacksimportEarlyStopping,ModelCheckpointimportmatplotlibfrommatplotlibimportpyplotimportosfromrandomimportrandomfromnumpyimportarrayfromnumpyimportcumsumimportpandasaspdfromkeras.layersimportTimeDistributedfromkeras.layersimportBidirectionalfromkeras.preprocessing.textimportTokenizerfromsklearn.preprocessingimportLabelEncoderos.environ['KMP_DUPLICATE_LIB_OK']='True'%matplotlibfromsklearn.feature_extraction.textimportCountVectorizer# define 10-fold cross validation test harnesskfold=StratifiedKFold(n_splits=10,shuffle=True,random_state=seed)#read in the filedf=pd.read_csv('dna_sequences.txt')X=list(df['dna_sequence'])y=list(df['class'])#convert the sequences to integers for learningtokenizer=Tokenizer(num_words=5,char_level=True)tokenizer.fit_on_texts(X)data_encoded=tokenizer.texts_to_matrix(X,mode='count')kf=kfold.get_n_splits(data_encoded)cvscores=[]#for each train, test in cross validation sub-setfortrain,testinkfold.split(data_encoded,y):X_train,X_test=data_encoded[train],data_encoded[test]y_train,y_test=data_encoded[train],data_encoded[test]#add layers to modelmodel=Sequential()model.add(Embedding(3000,32,input_length=5))model.add(Dropout(0.2))model.add(Bidirectional(LSTM(20,return_sequences=True),input_shape=(5,1)))model.add(LSTM(100))model.add(Dropout(0.2))model.add(Dense(5,activation='sigmoid'))#compile the modelmodel.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])#check the modelprint(model.summary())#monitor val accuracy and perform early stoppinges=EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=200)mc=ModelCheckpoint('best_model.h5',monitor='val_accuracy',mode='max',verbose=1,save_best_only=True)#fit the modelmodel.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=5,batch_size=64)#change values#evaluate the modelscores=model.evaluate(X_test,y_test,verbose=0)print("%s: %.2f%%"%(model.metrics_names[1],scores[1]*100))cvscores.append(scores[1]*100)#check the accuracyprint("%.2f%% (+/- %.2f%%)"%(numpy.mean(cvscores),numpy.std(cvscores)))#predict for new set of seqspred_list=['GTGTGCGCT','GGGGGTCGCTCCCCCC','AAATGTTGT','GTGTGTGGG','CCCCTATATA']#output a probability of sequence being found in each class as described above, and plot accuracy and loss

它会运行并按预期打印精度(精度不是很高，为62％，但是我可以解决这个问题，这是我的第一个NN，只想运行一个示例)。

我的问题是关于预测的。有人可以向我展示一个从拟合模型到实际预测的跳跃的例子。我认为该算法涉及：

通过交叉验证找到最佳模型(我尝试将其与监控器val精度部分合并)

预测类别的序列列表在pred_list中

适合从培训到pred_list的最佳模型

返回问题的概率。

我从其他问题(例如在这里)中知道这存在：

prediction=model.predict(np.array(tk.texts_to_sequences(text)))print(prediction)

....但我不知道如何将其与交叉验证结合使用，也无法获得所需的输出(即，在训练中每个序列被分配为1,2或3类的概率为三类数据集，其中每个序列可以出现在多个类别中)。

编辑1：根据以下注释，我将代码结尾更改为：

(在交叉验证循环中，因此应缩进)

#evaluate the modelscores=model.evaluate(X_test,y_test,verbose=0)print("%s: %.2f%%"%(model.metrics_names[1],scores[1]*100))cvscores.append(scores[1]*100)#predict for new set of seqspred_list=['GTGTGCGCT','GGGGGTCGCTCCCCCC','AAATGTTGT','GTGTGTGGG','CCCCTATATA','GGGGGGGGGTTTTTTTT']prediction=model.predict(np.array(tokenizer.texts_to_sequences(pred_list)))predcvscores.append(prediction)

(跨交叉验证循环)

print(predcvscores)#check the accuracyprint("%.2f%% (+/- %.2f%%)"%(numpy.mean(cvscores),numpy.std(cvscores)))

我得到错误：

Errorwhen checking input:expected embedding_3_input to have shape(5,)but got arraywithshape(1,)

我猜这是在说我不能只读一组类似pred_list的序列吗？这是不可能的，还是我没有采取正确的方法？另外，我不确定对于pred_list中的每个项目，此方法是否会为我提供可能出现在类别1,2或3中的输出，但也许我是错的，而且会。

解决方案

您在一个问题中提出了太多且不相关的问题，并且其中存在多个问题。我将尝试解决我认为最严重的问题。

首先，如果您有以下形式的案例

FeatureLabelAATGTCG3AATGTCG0AATGTCG1

也就是说，同一特征可以属于0、1或3类，而没有其他任何特征，那么这里的信息是，监督分类可能不太适合您遇到的问题；要这样，您应该使用其他功能。

如您所说，如果您仅对第1、2和3类感兴趣，并且

不是0，我不在乎0

那么在数据准备阶段您应该做的第一件事就是从数据集中删除所有0类实例；目前尚不清楚您是否在这里这样做，即使您这样做，也仍然不清楚为什么您仍将0级留在讨论中。

其次(假设您的分类问题中确实只剩下3个类)，显示为模型的预期输出：

GCTAGATGACAGT(0.9,0.1,0.2)TTTTAAAACAG(0.7,0.6,0.3)TAGCTATACT(0.3,0.3,0.2)TGGGGCAAAAAAAA(0.1,0.5,0.6)

是不正确的;在多类分类中，返回的概率(即，括号中的数字)必须精确地相加为1，在此情况并非如此。

第三，由于存在多类分类问题，因此损失应该为categorical_crossentropy，而不是binary_crossentropy，这仅用于二进制分类问题。

第四，再次假设您只剩下3个类别，那么模型的最后一层应该是

model.add(Dense(3,activation='softmax')# no. of units here should be equal to the no. of classes)

而标签y应该是一键编码的(您可以使用Keras函数轻松地做到这一点to_categorical)。

第五，在循环开始时仔细查看数据定义：

X_train,X_test=data_encoded[train],data_encoded[test]y_train,y_test=data_encoded[train],data_encoded[test]

您可以轻松地看到您既将要素作为要素又作为标签传递。我只能猜测这一定是您的错字。标签应为：

y_train,y_test=y[train],y[test]

关于您的预测时间错误

Errorwhen checking input:expected embedding_3_input to have shape(5,)but got arraywithshape(1,)

这是由于input_length=5嵌入层中的参数所致。我在这里承认，我对Keras嵌入层完全不熟悉；您可能需要检查文档，以确保此参数和分配的值确实可以实现您认为/打算执行的操作。

除此之外，关于您的特定问题：

我的问题是关于预测的。有人可以向我展示一个从拟合模型到实际预测的跳跃的例子。

you should just re-compile and re-fit the model again outside the CV loop (possibly using the "best" number of epochs found during CV) with the whole of your data, and then use it for predictions.

I guess it should be clear by now that, given the above issues, your reported accuracy of 62% does not actually mean anything; for good or bad, Keras will not "protect" you if you attempt to do things that are not meaningful from a modeling perspective (like most of the things I have mentioned above), like using binary cross entropy in a multi-class problem, or using accuracy in a regression setting...