Replicating web-based Blat percent identity and score calculations |
Question: Response: To calculate the percent ID, incorporate the following code and function into a program that processes your Blat PSL output. The parameter isMrna should be set to TRUE, regardless of whether the input sequence is mRNA or protein. The percent identity score is calculated like this:
100.0 - pslCalcMilliBad(psl, TRUE) * 0.1 Here is the source for pslCalcMilliBad:
int pslCalcMilliBad(struct psl *psl, boolean isMrna) /* Calculate badness in parts per thousand. */ { int sizeMul = pslIsProtein(psl) ? 3 : 1; int qAliSize, tAliSize, aliSize; int milliBad = 0; int sizeDif; int insertFactor; int total; qAliSize = sizeMul * (psl->qEnd - psl->qStart); tAliSize = psl->tEnd - psl->tStart; aliSize = min(qAliSize, tAliSize); if (aliSize <= 0) return 0; sizeDif = qAliSize - tAliSize; if (sizeDif < 0) { if (isMrna) sizeDif = 0; else sizeDif = -sizeDif; } insertFactor = psl->qNumInsert; if (!isMrna) insertFactor += psl->tNumInsert; total = (sizeMul * (psl->match + psl->repMatch + psl->misMatch)); if (total != 0) milliBad = (1000 * (psl->misMatch*sizeMul + insertFactor + round(3*log(1+sizeDif)))) / total; return milliBad; } The complexity in milliBad arises primarily from how it handles inserts. Ignoring the inserts, the calculation is simply mismatches expressed as parts per thousand. However, the algorithm factors in insertion penalties as well, which are relatively weak compared to say blasts but still present. When huge inserts are allowed (which is necessary to accommodate introns), it is typically necessary to resort to logarithms like this calculation does. The pslIsProtein function called by pslCalcMilliBad is: boolean pslIsProtein(const struct psl *psl) /* is psl a protein psl (are it's blockSizes and scores in protein space) */ { int lastBlock = psl->blockCount - 1; return (((psl->strand[1] == '+' ) && (psl->tEnd == psl->tStarts[lastBlock] + 3*psl->blockSizes[lastBlock])) || ((psl->strand[1] == '-') && (psl->tStart == (psl->tSize-(psl->tStarts[lastBlock] + 3*psl->blockSizes[lastBlock]))))); } This function automatically determines whether or not the PSL output file contains alignment information for a protein query. Alternatively, you could write the program such that the user specifies if the query is a protein or not. The score calculation is generated by the following function: int pslScore(const struct psl *psl) /* Return score for psl. */ { int sizeMul = pslIsProtein(psl) ? 3 : 1; return sizeMul * (psl->match + ( psl->repMatch>>1)) - sizeMul * psl->misMatch - psl->qNumInsert - psl->tNumInsert; } For help with creating a C program to do perform these calculations, you may want to use the libraries from the Genome Browser source code. See our FAQ on source code licensing and downloads for information on obtaining the source. The file kent/src/lib/psl.c contains the pslCalcMilliBad, pslIsProtein and pslScore functions and also a useful function called pslLoadAll that loads the psl file into a linked list structure. The definition of the psl struct can be found in kent/src/inc/psl.h.
perl:
#!/usr/bin/perl # $insertFactor += $cols[6]; # 若为蛋白质 my $milliBad = (1000 * ($cols[1]*$sizeMul + $insertFactor + |