Rosalind Java| Finding a Shared Motif_rosalind find a share-CSDN博客

本文链接：https://blog.csdn.net/m0_64240043/article/details/123027907

Rosalind编程问题之寻找共有的motif。

Finding a Shared Motif

Problem
A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.

Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.

Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.
Sample input：

Rosalind_1
GATTACA
Rosalind_2
TAGACCA
Rosalind_3
ATACA

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

AC (此处如果输出时TA也可以)

这道问题需要我们读取十条FASTA格式的碱基序列，并且输出最长的motif（如果最长的共有motif有很多条，则输出任一最长的motif即可）。解决这项问题首先要明确处理步骤！！不然会浪费宝贵的计算资源。

我们的输入文件包括了十条DNA序列，并且是fasta格式的文件。为了避免手动输入十条序列（因为懒）我们需要首先定义两个方法：**读取fasta格式文件为文本文件；从文本文件中分离的到碱基序列。**在此之上，我们才能拿到十条DNA序列，并进行后继处理。

下面重点来了：关键问题在于最长的motif怎么找？
在十条序列中，最长的共有motif再长也不会长过最短的输入序列。因此我们要从输入文件中找到最短的fasta序列作为突破口。因此我们需要定义方法获取十条序列中的最短序列。再在最短序列中逐一缩短碱基长度，遍历最短序列，获取最短序列的子序列。将获得的一系列子序列在剩余九条序列中逐一比对。如果有一条子序列在剩余九条序列中全部match成功，那么这条子序列就是我们要的共有motif，随即结束遍历最短序列。

我们梳理一下全部流程方法：
1.下载并读取Rosalind输入文件为字符串格式。
2.提取输入文件中的碱基序列，去掉Rosalind标签。
3.获取碱基序列中的最短序列。
4.遍历最短序列的子序列。
5.将子序列与剩余序列进行比对，全部比对成功即为shared motif。

下面是代码流程：

子方法1：读取rosalind下载的文本文件

传入的字符串类型变量String fileName是文件rosalind_lcsm.txt所在的地址，例如：C:/Users/Administrator/Desktop/rosalind_lcsm.txt

    //method1.读取文本文件
    public static String readFileContent(String fileName) {
        File file = new File(fileName);
        BufferedReader reader = null;
        StringBuffer sbf = new StringBuffer();
        try {
            reader = new BufferedReader(new FileReader(file));
            String tempStr;
            while ((tempStr = reader.readLine()) != null) {
                sbf.append(tempStr);
            }
            reader.close();
            return sbf.toString();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (reader != null) {
                try {
                    reader.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
        }
        return sbf.toString();
    }

子方法2：移除fasta格式rosalind标签行并保存为数组形式

    //method2.rosalind移除fasta格式的核酸标签行，并保存为数组形式
    public static String[] removeIndexRosalind(String[] arr) {
        //需要删除的数的索引index
        int index = 0;
        //定义存储删除元素后的arr的新数组newArr
        String[] newArr = new String[arr.length - 1];
        //定义新数组的索引newArr
        int j = 0;
        //for循环将未删除的元素按原保存顺序保存到新数组
        for (int i = 0; i < arr.length; i++) {
            //判断数组arr中需删除的索引index与循环到的i是否相同,不同时持续赋值j++,相同时则跳过继续比较
            if (index != i) {
                //index不等于i持续赋值
                newArr[j] = arr[i];
                //newArr[]向后移动一个元素
                j++;
            }
        }
        //输出新数组newArr
        return newArr;

    }

经过上述两个方法处理的Rosalind输入文件，会在内存中形成一个数组newArr。其中保存了十条核酸序列。那么下面我们就需要定义方法来获取并遍历十条序列中的最短序列。

子方法3：输出核酸集合中最短的序列

输入数据是刚刚得到的数组类型文件String[] NEWDNA，里面保存了十条碱基序列。可以通过遍历数组获取某一索引下元素的字符串长度。比较十条序列的长度即可得到最短序列。

    //method3.输出核酸集合中最短的序列
    public static String findmin(String[] NEWDNA) {
        for (String s1 : NEWDNA) {
            int count = 0;
            for (String s2 : NEWDNA) {
                if (s1.length() <= s2.length()) {
                    count++;
                }
                if (count == NEWDNA.length) {
                    String seq = s1;
                    return seq;
                }
            }
        }
        return "None";
    }

子方法4：遍历得到序列的子序列，并进行比对

运行完上一步之后，我们就可以获得子序列中的最短序列，下面我们要遍历这条最短序列并与剩余序列进行比对。因此要向方法四中传入最短序列String seq，和保存有剩余DNA序列的数组String[] DNA。

此外，我们需要定义方法判断子序列是否出现在其余的核酸序列，我们可以参考上一篇博文。在此之上稍加修改即可判断motif存在与否。

   //method4.输出共有的motif并进行比对
    public static String shared_motif(String seq, String[] DNA) {
        //1.找到所有序列中长度最小的序列seq
        for (int k = seq.length(); k > 0; k--) {
            for (int i = 0; i < seq.length() - k + 1; i++) {
                //2.获取最短序列的子序列
                String frag = seq.substring(i, i + k);
                int count = 0;
                //3.在其余序列中比较上述子序列
                for (String s : DNA) {
                    if (!findMotif(s, frag)) {
                        break;//一旦不匹配就终止！！此法能大量节省计算资源。
                    } else {
                        count++;
                        //4.如果在其余所有序列中都检索到，则返回此子序列
                        if (count == DNA.length) {
                            return frag;
                        }
                    }
                }
            }
        }
        return "无公共motif";
    }


    //定义一个检索motif方法，检索到则返回true，反之false
    public static boolean findMotif(String DNA, String motif) {
        //1.总共要比对DNA.length()-motif.length()+1次，故为循环次数
        for (int i = 0; i < DNA.length() - motif.length() + 1; i++) {
            //2.切割DNA，一旦切割后的序列匹配于motif则输出位置
            if (DNA.substring(i, i + motif.length()).equals(motif)) {
                return true;
            }
        }
        return false;
    }

main方法

最终我们将四个子方法连接在一起形成main方法（附上全部子方法）。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;

public class Finding_a_shared_motif {
    public static void main(String[] args) {
        //1.读取fasta格式数据为字符串类型
        String s = readFileContent("C:/Users/Administrator/Desktop/rosalind_lcsm.txt");
        //2.按照数据特点切分字符串并将切割后的序列存储入数组中
        String[] DNA = s.split(">Rosalind_\\d*");
        String[] NEWDNA = removeIndexRosalind(DNA);

        //4.遍历NEWDNA集合并输出长度最短的字符
        String seq = findmin(NEWDNA);
        System.out.println("最短核酸序列为：" + seq);

        //5.比对公共motif
        String sharedMotif = shared_motif(seq, NEWDNA);
        System.out.println("公共motif为：" + sharedMotif);
    }

    //method4.输出共有的motif
    public static String shared_motif(String seq, String[] DNA) {
        //1.找到所有序列中长度最小的序列seq
        for (int k = seq.length(); k > 0; k--) {
            for (int i = 0; i < seq.length() - k + 1; i++) {
                //2.获取最短序列的子序列
                String frag = seq.substring(i, i + k);
                int count = 0;
                //3.在其余序列中比较上述子序列
                for (String s : DNA) {
                    if (!findMotif(s, frag)) {
                        break;
                    } else {
                        count++;
                        //4.如果在其余所有序列中都检索到，则返回此子序列
                        if (count == DNA.length) {
                            return frag;
                        }
                    }
                }
            }
        }
        return "无公共motif";
    }


    //method3.输出核酸集合中最短的序列
    public static String findmin(String[] NEWDNA) {
        for (String s1 : NEWDNA) {
            int count = 0;
            for (String s2 : NEWDNA) {
                if (s1.length() <= s2.length()) {
                    count++;
                }
                if (count == NEWDNA.length) {
                    String seq = s1;
                    return seq;
                }
            }
        }
        return "None";
    }

    //copy上一章节的findMotif方法，并改为输出布尔类型数据。
    public static boolean findMotif(String DNA, String motif) {
        //1.总共要比对DNA.length()-motif.length()+1次，故为循环次数
        for (int i = 0; i < DNA.length() - motif.length() + 1; i++) {
            //2.切割DNA，一旦切割后的序列匹配于motif则输出位置
            if (DNA.substring(i, i + motif.length()).equals(motif)) {
                return true;
            }
        }
        return false;
    }

    //method1.读取文本文件
    public static String readFileContent(String fileName) {
        File file = new File(fileName);
        BufferedReader reader = null;
        StringBuffer sbf = new StringBuffer();
        try {
            reader = new BufferedReader(new FileReader(file));
            String tempStr;
            while ((tempStr = reader.readLine()) != null) {
                sbf.append(tempStr);
            }
            reader.close();
            return sbf.toString();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (reader != null) {
                try {
                    reader.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
        }
        return sbf.toString();
    }

    //method2.rosalind移除fasta格式的核酸标签行，并保存为数组形式
    public static String[] removeIndexRosalind(String[] arr) {
        //需要删除的数的索引index
        int index = 0;
        //定义存储删除元素后的arr的新数组newArr
        String[] newArr = new String[arr.length - 1];
        //定义新数组的索引newArr
        int j = 0;
        //for循环将未删除的元素按原保存顺序保存到新数组
        for (int i = 0; i < arr.length; i++) {
            //判断数组arr中需删除的索引index与循环到的i是否相同,不同时持续赋值j++,相同时则跳过继续比较
            if (index != i) {
                //index不等于i持续赋值
                newArr[j] = arr[i];
                //newArr[]向后移动一个元素
                j++;
            }
        }
        //输出新数组newArr
        return newArr;
    }
}