Blat percent identity and score calculations

Replicating web-based Blat percent identity and score calculations
 
 

Question:
"Using my own command-line Blat server, how can I replicate the percent identity and score calculations produced by web-based Blat?"

Response:
There isn't an option to command-line Blat that gives you the percent ID and the score. Instead, you will have to write your own program to produce the calculations, incorporating some of the functions from the Genome Browser source code.

To calculate the percent ID, incorporate the following code and function into a program that processes your Blat PSL output. The parameter isMrna should be set to TRUE, regardless of whether the input sequence is mRNA or protein.

The percent identity score is calculated like this:

 

100.0 - pslCalcMilliBad(psl, TRUE) * 0.1

Here is the source for pslCalcMilliBad:

 

int pslCalcMilliBad(struct psl *psl, boolean isMrna) /* Calculate badness in parts per thousand. */ { int sizeMul = pslIsProtein(psl) ? 3 : 1; int qAliSize, tAliSize, aliSize; int milliBad = 0; int sizeDif; int insertFactor; int total; qAliSize = sizeMul * (psl->qEnd - psl->qStart); tAliSize = psl->tEnd - psl->tStart; aliSize = min(qAliSize, tAliSize); if (aliSize <= 0) return 0; sizeDif = qAliSize - tAliSize; if (sizeDif < 0) { if (isMrna) sizeDif = 0; else sizeDif = -sizeDif; } insertFactor = psl->qNumInsert; if (!isMrna) insertFactor += psl->tNumInsert; total = (sizeMul * (psl->match + psl->repMatch + psl->misMatch)); if (total != 0) milliBad = (1000 * (psl->misMatch*sizeMul + insertFactor + round(3*log(1+sizeDif)))) / total; return milliBad; }

The complexity in milliBad arises primarily from how it handles inserts. Ignoring the inserts, the calculation is simply mismatches expressed as parts per thousand. However, the algorithm factors in insertion penalties as well, which are relatively weak compared to say blasts but still present. When huge inserts are allowed (which is necessary to accommodate introns), it is typically necessary to resort to logarithms like this calculation does.

The pslIsProtein function called by pslCalcMilliBad is:

boolean pslIsProtein(const struct psl *psl) /* is psl a protein psl (are it's blockSizes and scores in protein space) */ { int lastBlock = psl->blockCount - 1; return (((psl->strand[1] == '+' ) && (psl->tEnd == psl->tStarts[lastBlock] + 3*psl->blockSizes[lastBlock])) || ((psl->strand[1] == '-') && (psl->tStart == (psl->tSize-(psl->tStarts[lastBlock] + 3*psl->blockSizes[lastBlock]))))); }

This function automatically determines whether or not the PSL output file contains alignment information for a protein query. Alternatively, you could write the program such that the user specifies if the query is a protein or not.

The score calculation is generated by the following function:

int pslScore(const struct psl *psl) /* Return score for psl. */ { int sizeMul = pslIsProtein(psl) ? 3 : 1; return sizeMul * (psl->match + ( psl->repMatch>>1)) - sizeMul * psl->misMatch - psl->qNumInsert - psl->tNumInsert; }

For help with creating a C program to do perform these calculations, you may want to use the libraries from the Genome Browser source code. See our FAQ on source code licensing and downloads for information on obtaining the source. The file kent/src/lib/psl.c contains the pslCalcMilliBad, pslIsProtein and pslScore functions and also a useful function called pslLoadAll that loads the psl file into a linked list structure. The definition of the psl struct can be found in kent/src/inc/psl.h.

 

 

 

 

perl:

 

#!/usr/bin/perl
use strict;
open(IN,"try.psl") or die "Cannot open file gi"; 
open(OUT, "> out") or die "Cannot create file!";

while(<IN>) {
        chomp $_;
        my @v = split(//t/,$_);
        get_pid(@_);

}
sub get_pid {
         my @line = @_;
         my $pid = (100.0 - (&pslCalcMilliBad(@line) * 0.1));
         print "The percentage: $pid/n";
         return $pid;
}

sub pslCalcMilliBad {
         my @cols = @_;

         # sizeMul depens of dna/Prot
         my $sizeMul = 1;                  #核酸等于1,蛋白质等于3
        
     # cols[0]  matches
         # cols[1]  misMatches
         # cols[2]  repMaches
         # cols[4]  qNumInsert
         # cols[6]  tNumInsert
         # cols[11] qStart
         # cols[12] qEnd
         # cols[15] tStart
         # cols[16] tEnd

         my $qAliSize = $sizeMul * ($cols[12] - $cols[11]);
         my $tAliSize = $cols[16] - $cols[15];

         # I want the minimum of qAliSize and tAliSize
         my $aliSize;
         $qAliSize < $tAliSize ? $aliSize = $qAliSize : $aliSize =  
$tAliSize;

         # return 0 is AliSize == 0
         return 0 if ($aliSize <= 0);

         # size diff
         my $sizeDiff = $qAliSize - $tAliSize;
         if ($sizeDiff < 0) {
             $sizeDiff = 0;  
         }

         # insert Factor
         my $insertFactor = $cols[4]; 

# $insertFactor += $cols[6]; # 若为蛋白质

         my $milliBad = (1000 * ($cols[1]*$sizeMul + $insertFactor +  
&round(3*log( 1 + $sizeDiff)))) / ($sizeMul * ($cols[0] + $cols[2]
+ $cols[1]));
                return $milliBad;

}

sub round {
         my $number = shift;
         return int($number + .5);

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,这是一个比较复杂的程序,需要用到许多库和数学公式,我将逐步给出程序实现的步骤和代码示例。 首先,我们需要导入所需的库,包括math、numpy、matplotlib等。其中,math库提供了一些常用的数学函数,numpy库提供了高性能的数组运算,matplotlib库用于绘制图形。 ```python import math import numpy as np import matplotlib.pyplot as plt ``` 接下来,我们需要定义一些常量和参数,包括地球半径、经纬度转换参数、A点和B点的经纬度和海拔等信息。 ```python # 地球半径,单位为千米 EARTH_RADIUS = 6371.0 # 经纬度转换参数 DEG2RAD = math.pi / 180.0 RAD2DEG = 180.0 / math.pi # A点经纬度和海拔 ALAT, ALON, AALT = 30.0, 120.0, 50.0 # B点经纬度和海拔 BLAT, BLON, BALT = 31.0, 121.0, 100.0 # A点检测距离,单位为千米 DISTANCE = 200.0 # 方位辐射角度,单位为度 AZIMUTH_ANGLE = 40.0 # 俯仰角度,单位为度 ELEVATION_ANGLE = 15.0 ``` 接下来,我们需要定义一些辅助函数,包括经纬度转换函数、距离计算函数、方位角计算函数、俯仰角计算函数等。 ```python def haversine(lat1, lon1, lat2, lon2): """ 计算两点之间的距离 :param lat1: 第一个点的纬度,单位为度 :param lon1: 第一个点的经度,单位为度 :param lat2: 第二个点的纬度,单位为度 :param lon2: 第二个点的经度,单位为度 :return: 两点之间的距离,单位为千米 """ dlat = math.radians(lat2 - lat1) dlon = math.radians(lon2 - lon1) a = math.sin(dlat / 2) ** 2 + math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) * math.sin(dlon / 2) ** 2 c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) return EARTH_RADIUS * c def bearing(lat1, lon1, lat2, lon2): """ 计算两点之间的方位角 :param lat1: 第一个点的纬度,单位为度 :param lon1: 第一个点的经度,单位为度 :param lat2: 第二个点的纬度,单位为度 :param lon2: 第二个点的经度,单位为度 :return: 两点之间的方位角,单位为度 """ dlat = math.radians(lat2 - lat1) dlon = math.radians(lon2 - lon1) y = math.sin(dlon) * math.cos(math.radians(lat2)) x = math.cos(math.radians(lat1)) * math.sin(math.radians(lat2)) - math.sin(math.radians(lat1)) * \ math.cos(math.radians(lat2)) * math.cos(dlon) return math.atan2(y, x) * RAD2DEG def elevation(lat1, lon1, alt1, lat2, lon2, alt2): """ 计算两点之间的俯仰角 :param lat1: 第一个点的纬度,单位为度 :param lon1: 第一个点的经度,单位为度 :param alt1: 第一个点的海拔,单位为千米 :param lat2: 第二个点的纬度,单位为度 :param lon2: 第二个点的经度,单位为度 :param alt2: 第二个点的海拔,单位为千米 :return: 两点之间的俯仰角,单位为度 """ dist = haversine(lat1, lon1, lat2, lon2) elv = math.atan2(alt2 - alt1, dist) return elv * RAD2DEG ``` 接下来,我们需要根据A点的经纬度和海拔,计算出A点在图上的坐标,并用五角星表示。 ```python # 计算A点在图上的坐标 ax, ay = np.array([ALON, ALAT]) * DEG2RAD ax, ay = EARTH_RADIUS * math.cos(ay) * math.sin(ax), EARTH_RADIUS * math.sin(ay) ax, ay = ax / 1000.0, ay / 1000.0 # 绘制A点 plt.plot(ax, ay, marker='*', markersize=10, markerfacecolor='yellow', markeredgecolor='black') ``` 接下来,我们需要根据B点的经纬度和海拔,计算出B点在图上的坐标,并用三角号表示。 ```python # 计算B点在图上的坐标 bx, by = np.array([BLON, BLAT]) * DEG2RAD bx, by = EARTH_RADIUS * math.cos(by) * math.sin(bx), EARTH_RADIUS * math.sin(by) bx, by = bx / 1000.0, by / 1000.0 # 绘制B点 plt.plot(bx, by, marker='^', markersize=10, markerfacecolor='green', markeredgecolor='black') ``` 接下来,我们需要根据A点的检测距离、方位辐射角度和俯仰角度,计算出扇形阴影面的边界,并在图上绘制出来。 ```python # 计算扇形阴影面的边界 theta = np.linspace(-AZIMUTH_ANGLE / 2, AZIMUTH_ANGLE / 2, 50) * DEG2RAD phi = np.linspace(0, ELEVATION_ANGLE, 50) * DEG2RAD r = DISTANCE * 1000.0 x = r * np.outer(np.sin(theta), np.cos(phi)) + ax y = r * np.outer(np.cos(theta), np.cos(phi)) + ay z = r * np.outer(np.ones(np.size(theta)), np.sin(phi)) # 绘制扇形阴影面 ax = plt.gca(projection='3d') ax.plot_surface(x, y, z, color='gray', alpha=0.5) ``` 最后,我们需要设置一些图形参数,如坐标轴范围、标题、标签等,并显示图形。 ```python # 设置坐标轴范围 plt.xlim([min(ax, bx) - 1, max(ax, bx) + 1]) plt.ylim([min(ay, by) - 1, max(ay, by) + 1]) # 设置标题和标签 plt.title('Coordinate System') plt.xlabel('Longitude (km)') plt.ylabel('Latitude (km)') # 显示图形 plt.show() ``` 完整代码如下:

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值