rosalind练习题二十八

# Problem

# An array is a structure containing an ordered collection of objects (numbers, strings, other arrays, etc.). We let A[k] denote the k-th value in array A. You may like to think of an array as simply a matrix having only one row.

# A random string is constructed so that the probability of choosing each subsequent symbol is based on a fixed underlying symbol frequency.

# GC-content offers us natural symbol frequencies for constructing random DNA strings. If the GC-content is x, then we set the symbol frequencies of C and G equal to x2 and the symbol frequencies of A and T equal to 1−x2. For example, if the GC-content is 40%, then as we construct the string, the next symbol is 'G'/'C' with probability 0.2, and the next symbol is 'A'/'T' with probability 0.3.

# In practice, many probabilities wind up being very small. In order to work with small probabilities, we may plug them into a function that "blows them up" for the sake of comparison. Specifically, the common logarithm of x (defined for x>0 and denoted log10(x)) is the exponent to which we must raise 10 to obtain x.

# See Figure 1 for a graph of the common logarithm function y=log10(x). In this graph, we can see that the logarithm of x-values between 0 and 1 always winds up mapping to y-values between −∞ and 0: x-values near 0 have logarithms close to −∞, and x-values close to 1 have logarithms close to 0. Thus, we will select the common logarithm as our function to "blow up" small probability values for comparison.

# Given: A DNA string s of length at most 100 bp and an array A containing at most 20 numbers between 0 and 1.

# Return: An array B having the same length as A in which B[k] represents the common logarithm of the probability that a random string constructed with the GC-content found in A[k] will match s exactly.

# Sample Dataset

# ACGATACAA

# 0.129 0.287 0.423 0.476 0.641 0.742 0.783

# Sample Output

# -5.737 -5.217 -5.263 -5.360 -5.958 -6.628 -7.009

# 给定一个最长为100个碱基对的DNA字符串s和一个包含最多20个介于0和1之间的数字的数组A。要求返回一个与A具有相同长度的数组B,其中B[k]表示在使用A[k]中的GC含量构建的随机字符串与s完全匹配的概率的常用对数。

import math

def compute_log(s, A):

    # 计算碱基的频率

    freq = {'A': 0, 'T': 0, 'G': 0, 'C': 0}

    for base in s:

        freq[base] += 1

    # 初始化结果数组B

    B = []

    # 计算每个GC含量对应的概率

    for gc_content in A:

        # 计算AT和GC碱基的期望数量

        at_count = (1 - gc_content) / 2

        gc_count = gc_content / 2

        # 计算与s完全匹配的概率

        prob = (at_count ** freq['A']) * (at_count ** freq['T']) * (gc_count ** freq['G']) * (gc_count ** freq['C'])

        # 取对数并将结果添加到数组B中

        log_prob = math.log10(prob)

        B.append(log_prob)

    return B

# 测试样例

s = "ACGATACAA"

A = [0.129, 0.287, 0.423, 0.476, 0.641, 0.742, 0.783]

result = compute_log(s, A)

print(result)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值