Kaggle竞赛题目之——Predicting a Biological Response

最新推荐文章于 2024-07-30 17:41:43 发布

sanfendi

最新推荐文章于 2024-07-30 17:41:43 发布

阅读量4.9k

点赞数

本文链接：https://blog.csdn.net/laozhaokun/article/details/41446759

版权

机器学习专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Predict a biological response of molecules from their chemical properties
从分子的化学属性中预测其生物反应。

The objective of the competition is to help us build as good a model as possible so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.

We have shared the data in the comma separated values (CSV) format. Each row in this data set represents a molecule. The first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0). The remaining columns represent molecular descriptors (d1 through d1776), these are calculated properties that can capture some of the characteristics of the molecule - for example size, shape, or elemental constitution. The descriptor matrix has been normalized.

简述：给定CSV文件，每行代表一个分子。第一列代表真实的生物反应，有反应(1),无反应(0)。第2列到第1777列代表分子的属性，例如，大小、形状或元素等。

这个题目的比赛早就结束了，但是仍然可以提交5次结果，查看自己的得分排名。只要提交一个csv格式的结果文件就可以了。

看到0、1，可以确定这是一个二分类问题。

对于这样一个二分类，多属性的问题，首先想到用逻辑回归来试一下。

下面是使用Logistic Regression做预测的python代码：

#!/usr/bin/env python
#coding:utf-8
'''
Created on 2014年11月24日
@author: zhaohf
'''
from sklearn.linear_model import LogisticRegression
from numpy import genfromtxt,savetxt
def main():
    dataset = genfromtxt(open('../Data/train.csv','r'),delimiter=',',dtype='f8')[1:]
    test = genfromtxt(open('../Data/test.csv','r'),delimiter=',',dtype='f8')[1:]
    target = [x[0] for x in dataset]
    train = [x[1:] for x in dataset]
    lr = LogisticRegression()
    lr.fit(train, target)
    predicted_probs = [[index+1,x[1]] for index,x in enumerate(lr.predict_proba(test))]
    savetxt('../Submissions/lr_benchmark.csv',predicted_probs,delimiter=',',fmt='%d,%f',header='Molecule,PredictedProbability',comments='')
    
if __name__ == '__main__':
    main()

通过损失函数检验，最后的public score是 0.59425，这算是一个非常差的分数了。排名数百名开外。

下面使用SVM来试一下。代码非常相似。

#!/usr/bin/env python
#coding:utf-8
'''
Created on 2014年11月24日
@author: zhaohf
'''
from sklearn import svm
from numpy import genfromtxt,savetxt
def main():
    dataset = genfromtxt(open('../Data/train.csv','r'),delimiter=',',dtype='f8')[1:]
    test = genfromtxt(open('../Data/test.csv','r'),delimiter=',',dtype='f8')[1:]
    target = [x[0] for x in dataset]
    train = [x[1:] for x in dataset]
    svc = svm.SVC(probability=True)
    svc.fit(train,target)
    predicted_probs = [[index+1,x[1]] for index,x in enumerate(svc.predict_proba(test))]
    savetxt('../Submissions/svm_benchmark.csv',predicted_probs,delimiter=',',fmt='%d,%f',header='MoleculeId,PredictedProbability',comments='')
    
if __name__ == '__main__':
    main()

SVM的得分是 0.52553。比起LR略好。

排行榜上做到最好的分数是0.37356，要想参加比赛取得成绩还是要做些努力的。