# Problem
# A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, "CG" is a common substring of "ACGTACGT" and "AACCGTATA", but it is not as long as possible; in this case, "CGTA" is a longest common substring of "ACGTACGT" and "AACCGTATA".
# Note that the longest common substring is not necessarily unique; for a simple example, "AA" and "CC" are both longest common substrings of "AACC" and "CCAA".
#
# Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.
# Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)
#
# Sample Dataset
# >Rosalind_1
# GATTACA
# >Rosalind_2
# TAGACCA
# >Rosalind_3
# ATACA
# Sample Output
# AC
# 题目要求我们找到一个 DNA 字符串集合中的最长公共子串,即motif。
def find_common_motif(strings):
shortest_string = min(strings, key=len)
for length in range(len(shortest_string), 0, -1):
for start in range(len(shortest_string) - length + 1):
substring = shortest_string[start:start + length]
if all(substring in string for string in strings):
return substring
return ""
# 读入数据
strings = []
with open("input4.txt", "r") as f:
current_string = ""
for line in f:
if line.startswith(">"): # FASTA格式下一行为数据描述,跳过
continue
current_string += line.strip()
if current_string[-1] == "": # 处理最后一个字符为空格的情况
current_string = current_string[:-1]
strings.append(current_string)
current_string = ""
# 寻找最长公共子串并打印输出
result = find_common_motif(strings)
print(result)