Prediction of siRNA functionality using generalized string kernel and support vector machine
Abstract:
GSK+SVM
GSK:generalized string kernel
result : classify effective and ineffective siRNAs use GSK + SVM
Introduction
useful method before this paper:
(1),siRNA duplexes 21nt have a 2 nt 3’ overhang at each mRNA end
(2),a target sequence should be begining 50-100 downstream of the start codon of the mRNA
(3),G/C 50%
Materials and methods
DATASET:
Dataset : Khvorova’s dataset , total 94 siRNAs , 53 siRNA effective , 41 siRNA ineffective
effective : with 90% or more gene silencing activity
ineffective : with less than 50% gene silencing activity
FEATURE map for siRNAs
GSK is based on mismatch string kernel(MSK) as well as in the spectrum kernel
k k k:the length of the sub-sequence of one string
m m m:at most m mismatches
MSK:
K ( k , m ) ( x , y ) = < Φ ( k , m ) ( x ) , Φ ( k , m ) ( y ) > K_{(k,m)}(x,y) = <\Phi_{(k,m)}(x),\Phi_{(k,m)}(y)> K(k,m)(x,y)=<Φ(k,m)(x),Φ(k,m)(y)>
K ( k , m ) ( x , y ) = K ( k , m ) ( x , y ) K ( k , m ) ( x , x ) K ( k , m ) ( y , y ) K_{(k,m)}(x,y) = \frac{K_{(k,m)}(x,y)}{\sqrt{K_{(k,m)}(x,x)}\sqrt{K_{(k,m)}(y,y)}} K(k,m)(x,y)=K(k,m)(x,x)K(k,m)(y,y)K(k,m)(x,y)
GSK is a sum of all the ( k i , m i ) (k_i,m_i) (ki,mi)-mismatch kernels:
K k 1 , m 1 , . . . , k s , m s = ∑ i < Φ ( k i , m i ) ( x ) , Φ ( k i , m i ) ( y ) > = ∑ i K ( k i , m i ) ( x , y ) K_{k_1,m_1,...,k_s,m_s} = \sum_i <\Phi_{(k_i,m_i)}(x),\Phi_{(k_i,m_i)}(y)> = \sum_i K_{(k_i,m_i)}(x,y) Kk1,m1,...,ks,ms=i∑<Φ(ki,mi)(x),Φ(ki,mi)(y)>=i∑K(ki,mi)(x,y)
SVM implementation:
linear kernel and soft margin
RESULT
Subsequence->Weight
TP,TN,FP,FN,Acc
LOOCV of the GSK/SVM algorithm
Validation of predictive performance of GSK/SVM algorithm against other genes
Discussion
advantage:without a prior knowledge , we could determine contribution of each parameters to siRNA,and it can be applied to siRNAs shorter or longer than 21-mer in length
disadvantage:we can not deduce the sequence of the other useful siRNA