[Genome] questions about parsing PSL file from BLAT
http://genome.ucsc.edu/FAQ/FAQblat "try filtering your Blat PSL output using either the pslReps or pslCDnaFilter program available in the Genome Browser source code" qNumInsert and tNumInsert are a count of the number of gaps; and these lower the score a little without penalizing greatly for long introns. The score displayed by hgBlat is not really a formal part of PSL at all, just something meant to help make it more useful to interactive human BLAT users. pslReps and pslCDnaFilter provide much more comprehensive filtering capabilities and are what most people use. You are perfectly welcome to invent your own scoring method that suits your research. There is no need to use pslCalcMilliBad. By the way, because we don't run our dna gfServer with masking options, our hgBlat PSL results will never have a non-zero value for repMatch, so you can ignore that. If you have more matches, the score is higher, mismatches and gaps lower your score. The ID that hgBlat returns does incorporate the length of the aligning parts into the value so it provide some measure of that coverage. total = (sizeMul * ( psl->match + psl->repMatch + psl->misMatch)); milliBad = (1000 * ( psl->misMatch*sizeMul + insertFactor + round(3*log(1+sizeDif)) ) ) / total; -Galt On Tue, 21 Apr 2009, Xianjun Dong wrote: > Hi, > > I have two questions about parsing PSL file from BLAT: > > 1. how can I understand the percent ID and score calculation, intuitively? > > From the FAQBlat( http://genome.ucsc.edu/FAQ/FAQblat#blat4), I can > understand the formula (for DNA alignment) > ID = 100.0 - pslCalcMilliBad( psl, TRUE) * 0.1 > as > ID = 100.0 - 100 * (misMatch+qNumInsert)/(match+repMatch+misMatch) > = 100 * (match+repMatch-qNumInsert) / (match+repMatch+misMatch), > Right? > > If my understanding is correct, could you help me understand the meaning > of this percent ID in a simple way? > I tried to understand this ID as coverage of matched bases relative to > the aligned part in query, but it's not. From the PSL output file, I > can see that the alignment length in query sequence (L) satisfies > L = (qEnd-qStart) = match + repMatch + misMatch + nCount + qNumInsert. > "match+repMatch+misMatch" is the aligned part of full L. But what does > "match+repMatch-qNumInsert" represent for? > > The same question for score, which is {match + int(repMatch/2) - > misMatch - qNumInsert - tNumInsert} as I understand. What does this mean? > > > 2. What can be a better threshold to filter hits from multiple queries? > > Percent ID and score are two criteria to assess a BLAT hit, but I found > it's hard to use any of these alone to define a threshold for filtering > short hits from multiple queries. Obviously, 95% ID (for example) alone > is not correctly, since some hits are very short, but 100% matched. > While, the score seems to be an absolute value for each query. It's not > possible to define a common score for all queries (which have different > length themselves). I am thinking if we can use, say > score / querySize > to define a common threshold (for example 95%) to filter out those hits > with small score. I was also thinking to use the highest score (for each > query) as reference to filter out hits with a more-than-threshold > decreased score. For example, all hits with score 60% lower than the > highest score will be removed. The highest score is calculated for each > query. > > Does anyone have experience in this problem? > > Thanks in advance > > Xianjun > _______________________________________________ > Genome maillist - Genome@soe.ucsc.edu > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________
http://genome.ucsc.edu/FAQ/FAQblat "try filtering your Blat PSL output using either the pslReps or pslCDnaFilter program available in the Genome Browser source code" qNumInsert and tNumInsert are a count of the number of gaps; and these lower the score a little without penalizing greatly for long introns. The score displayed by hgBlat is not really a formal part of PSL at all, just something meant to help make it more useful to interactive human BLAT users. pslReps and pslCDnaFilter provide much more comprehensive filtering capabilities and are what most people use. You are perfectly welcome to invent your own scoring method that suits your research. There is no need to use pslCalcMilliBad. By the way, because we don't run our dna gfServer with masking options, our hgBlat PSL results will never have a non-zero value for repMatch, so you can ignore that. If you have more matches, the score is higher, mismatches and gaps lower your score. The ID that hgBlat returns does incorporate the length of the aligning parts into the value so it provide some measure of that coverage. total = (sizeMul * ( psl->match + psl->repMatch + psl->misMatch)); milliBad = (1000 * ( psl->misMatch*sizeMul + insertFactor + round(3*log(1+sizeDif)) ) ) / total; -Galt On Tue, 21 Apr 2009, Xianjun Dong wrote: > Hi, > > I have two questions about parsing PSL file from BLAT: > > 1. how can I understand the percent ID and score calculation, intuitively? > > From the FAQBlat( http://genome.ucsc.edu/FAQ/FAQblat#blat4), I can > understand the formula (for DNA alignment) > ID = 100.0 - pslCalcMilliBad( psl, TRUE) * 0.1 > as > ID = 100.0 - 100 * (misMatch+qNumInsert)/(match+repMatch+misMatch) > = 100 * (match+repMatch-qNumInsert) / (match+repMatch+misMatch), > Right? > > If my understanding is correct, could you help me understand the meaning > of this percent ID in a simple way? > I tried to understand this ID as coverage of matched bases relative to > the aligned part in query, but it's not. From the PSL output file, I > can see that the alignment length in query sequence (L) satisfies > L = (qEnd-qStart) = match + repMatch + misMatch + nCount + qNumInsert. > "match+repMatch+misMatch" is the aligned part of full L. But what does > "match+repMatch-qNumInsert" represent for? > > The same question for score, which is {match + int(repMatch/2) - > misMatch - qNumInsert - tNumInsert} as I understand. What does this mean? > > > 2. What can be a better threshold to filter hits from multiple queries? > > Percent ID and score are two criteria to assess a BLAT hit, but I found > it's hard to use any of these alone to define a threshold for filtering > short hits from multiple queries. Obviously, 95% ID (for example) alone > is not correctly, since some hits are very short, but 100% matched. > While, the score seems to be an absolute value for each query. It's not > possible to define a common score for all queries (which have different > length themselves). I am thinking if we can use, say > score / querySize > to define a common threshold (for example 95%) to filter out those hits > with small score. I was also thinking to use the highest score (for each > query) as reference to filter out hits with a more-than-threshold > decreased score. For example, all hits with score 60% lower than the > highest score will be removed. The highest score is calculated for each > query. > > Does anyone have experience in this problem? > > Thanks in advance > > Xianjun > _______________________________________________ > Genome maillist - Genome@soe.ucsc.edu > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________
- [Genome] questions about parsing PSL file from BLAT Xianjun Dong
-
Re: [Genome] questions about parsing PSL file from BLAT Galt Barber
-
Reply via email to
Genome maillist - Genome@soe.ucsc.edu https://lists.soe.ucsc.edu/mailman/listinfo/genome