筛选psl文件到底依据同源性还是打分?~~pslReps

[Genome] questions about parsing PSL file from BLAT
http://genome.ucsc.edu/FAQ/FAQblat "try filtering your Blat PSL output using either the pslReps or pslCDnaFilter program available in the Genome Browser source code" qNumInsert and tNumInsert are a count of the number of gaps; and these lower the score a little without penalizing greatly for long introns. The score displayed by hgBlat is not really a formal part of PSL at all, just something meant to help make it more useful to interactive human BLAT users. pslReps and pslCDnaFilter provide much more comprehensive filtering capabilities and are what most people use. You are perfectly welcome to invent your own scoring method that suits your research. There is no need to use pslCalcMilliBad. By the way, because we don't run our dna gfServer with masking options, our hgBlat PSL results will never have a non-zero value for repMatch, so you can ignore that. If you have more matches, the score is higher, mismatches and gaps lower your score. The ID that hgBlat returns does incorporate the length of the aligning parts into the value so it provide some measure of that coverage. total = (sizeMul * ( psl->match + psl->repMatch + psl->misMatch)); milliBad = (1000 * ( psl->misMatch*sizeMul + insertFactor + round(3*log(1+sizeDif)) ) ) / total; -Galt On Tue, 21 Apr 2009, Xianjun Dong wrote: > Hi, > > I have two questions about parsing PSL file from BLAT: > > 1. how can I understand the percent ID and score calculation, intuitively? > > From the FAQBlat( http://genome.ucsc.edu/FAQ/FAQblat#blat4), I can > understand the formula (for DNA alignment) > ID = 100.0 - pslCalcMilliBad( psl, TRUE) * 0.1 > as > ID = 100.0 - 100 * (misMatch+qNumInsert)/(match+repMatch+misMatch) > = 100 * (match+repMatch-qNumInsert) / (match+repMatch+misMatch), > Right? > > If my understanding is correct, could you help me understand the meaning > of this percent ID in a simple way? > I tried to understand this ID as coverage of matched bases relative to > the aligned part in query, but it's not. From the PSL output file, I > can see that the alignment length in query sequence (L) satisfies > L = (qEnd-qStart) = match + repMatch + misMatch + nCount + qNumInsert. > "match+repMatch+misMatch" is the aligned part of full L. But what does > "match+repMatch-qNumInsert" represent for? > > The same question for score, which is {match + int(repMatch/2) - > misMatch - qNumInsert - tNumInsert} as I understand. What does this mean? > > > 2. What can be a better threshold to filter hits from multiple queries? > > Percent ID and score are two criteria to assess a BLAT hit, but I found > it's hard to use any of these alone to define a threshold for filtering > short hits from multiple queries. Obviously, 95% ID (for example) alone > is not correctly, since some hits are very short, but 100% matched. > While, the score seems to be an absolute value for each query. It's not > possible to define a common score for all queries (which have different > length themselves). I am thinking if we can use, say > score / querySize > to define a common threshold (for example 95%) to filter out those hits > with small score. I was also thinking to use the highest score (for each > query) as reference to filter out hits with a more-than-threshold > decreased score. For example, all hits with score 60% lower than the > highest score will be removed. The highest score is calculated for each > query. > > Does anyone have experience in this problem? > > Thanks in advance > > Xianjun > _______________________________________________ > Genome maillist - Genome@soe.ucsc.edu > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________

Reply via email to

Genome maillist - Genome@soe.ucsc.edu https://lists.soe.ucsc.edu/mailman/listinfo/genome

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值