Rosalind第34题：Error Correction in Reads

最新推荐文章于 2021-01-19 09:55:16 发布

automan_huyaoge

最新推荐文章于 2021-01-19 09:55:16 发布

阅读量163

点赞数

分类专栏： python 控制科学与工程

原文链接：http://rosalind.info/problems/corr/

版权

python 同时被 2 个专栏收录

211 篇文章 2 订阅

订阅专栏

控制科学与工程

179 篇文章 19 订阅

订阅专栏

本文介绍如何处理生物信息学中的测序数据，针对含有单核苷酸错误的FASTA文件，通过计算汉明距离和查找重复或唯一读取来找出并提供正确的序列更正。任务涉及识别错误读取，如TTCAT->TTGAT，GAGGA->GATGA，TTTCC->TTTCA。

摘要由CSDN通过智能技术生成

Problem

As is the case with point mutations, the most common type of sequencing error occurs when a single nucleotide from a read is interpreted incorrectly.

Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format. Some of these reads were generated with a single-nucleotide error. For each read in the dataset, one of the following applies:

was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);
is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to exactly one correct read in the dataset (or its reverse complement).

Return: A list of all corrections in the form "[old read]->[new read]". (Each correction must be a single symbol substitution, and you may return the corrections in any order.)

与点突变一样，当错误解读读物中的单个核苷酸时，会发生最常见的测序错误。

给出：以FASTA格式最多收集1000个等长（最多50 bp）的等长读段。这些读数中的一些是单核苷酸错误产生的。对于数据集中的每次读取，适用以下条件之一：

已正确排序并至少两次出现在数据集中（可能是反向互补）；
不正确，它在数据集中只会出现一次，并且汉明距离相对于数据集中恰好一个正确读取的位置（或其反向补码）为1。

返回值：所有更正的列表，形式为“ [旧读]-> [新读]”。（每个更正必须是单个符号替换，并且您可以按任何顺序返回更正。）

Sample Dataset

>Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC