biopython中文指南

When you hear the word Biopython what is the first thing that came to your mind? A python library to handle biological data…? You are correct! Biopython provides a set of tools to perform bioinformatics computations on biological data such as DNA data and protein data. I have been using Biopython ever since I started studying bioinformatics and it has never let me down with its functions. It is an amazing library which provides a wide range of functions from reading large files with biological data to aligning sequences. In this article, I will introduce you to some basic functions of Biopython which can make implementations much easier with just a single call.

当您听到Biopython一词时，您想到的第一件事是什么？一个处理生物学数据的python库...？你是对的！ Biopython提供了一套工具，可对DNA数据和蛋白质数据等生物学数据进行生物信息学计算。自从我开始研究生物信息学以来，我就一直在使用Biopython，但是它从来没有让我失望过它的功能。它是一个了不起的库，它提供了广泛的功能，从读取带有生物学数据的大文件到比对序列。在本文中，我将向您介绍Biopython的一些基本功能，这些功能只需一次调用就可以使实现更加容易。

入门 (Getting started)

The latest version available when I’m writing this article is biopython-1.77 released in May 2020.

在我撰写本文时，可用的最新版本是2020年5月发布的biopython-1.77 。

You can install Biopython using pip

您可以使用pip安装Biopython

pip install biopython

or using conda.

或使用conda 。

conda install -c conda-forge biopython

You can test whether Biopython is properly installed by executing the following line in the python interpreter.

您可以通过在python解释器中执行以下行来测试Biopython是否已正确安装。

import Bio

If you get an error such as ImportError: No module named Bio then you haven’t installed Biopython properly in your working environment. If no error messages appear, we are good to go.

如果您收到诸如ImportError: No module named Bio类的错误，则说明您的工作环境中没有正确安装Biopython。如果没有错误消息出现，我们很好。

In this article, I will be walking you through some examples where Seq, SeqRecord and SeqIO come in handy. We will go through the functions that perform the following tasks.

在本文中，我将向您介绍一些示例，其中Seq ， SeqRecord和SeqIO会派上用场。我们将介绍执行以下任务的功能。

Creating a sequence
创建一个序列
Get the reverse complement of a sequence
获取序列的反补
Count the number of occurrences of a nucleotide
计算核苷酸的出现次数
Find the starting index of a subsequence
查找子序列的起始索引
Reading a sequence file
读取序列文件
Writing sequences to a file
将序列写入文件
Convert a FASTQ file to FASTA file
将FASTQ文件转换为FASTA文件
Separate sequences by ids from a list of ids
按ID从ID列表中分离序列

1.创建一个序列 (1. Creating a sequence)

To create your own sequence, you can use the Biopython Seq object. Here is an example.

要创建自己的序列，可以使用Biopython Seq对象。这是一个例子。

>>> from Bio.Seq import Seq
>>> my_sequence = Seq("ATGACGTTGCATG")
>>> print("The sequence is", my_sequence)
The sequence is ATGACGTTGCATG
>>> print("The length of the sequence is", len(my_sequence))
The length of the sequence is 13

2.获得序列的反补 (2. Get the reverse complement of a sequence)

You can easily get the reverse complement of a sequence using a single function call reverse_complement().

您可以使用单个函数reverse_complement()轻松获得序列的反向补码。

>>> 
The reverse complement if the sequence is CATGCAACGTCAT

3.计算核苷酸的出现次数 (3. Count the number of occurrences of a nucleotide)

You can get the number of occurrence of a particular nucleotide using the count() function.

您可以使用count()函数获得特定核苷酸的出现count() 。

>>> print("The number of As in the sequence", my_sequence.count("A"))
The number of As in the sequence 3

4.查找子序列的起始索引 (4. Find the starting index of a subsequence)

You can find the starting index of a subsequence using the find() function.

您可以使用find()函数find()序列的起始索引。

>>> print("Found TTG in the sequence at index", my_sequence.find("TTG"))
Found TTG in the sequence at index 6

5.读取序列文件 (5. Reading a sequence file)

Biopython’s SeqIO (Sequence Input/Output) interface can be used to read sequence files. The parse() function takes a file (with a file handle and format) and returns a SeqRecord iterator. Following is an example of how to read a FASTA file.

Biopython的SeqIO (序列输入/输出)接口可用于读取序列文件。 parse()函数获取一个文件(具有文件句柄和格式)，并返回一个SeqRecord迭代器。以下是如何读取FASTA文件的示例。

from Bio import SeqIOfor record in SeqIO.parse("example.fasta", "fasta"):
    print(record.id)

record.id will return the identifier of the sequence. record.seq will return the sequence itself. record.description will return the sequence description.

record.id将返回序列的标识符。 record.seq将返回序列本身。 record.description将返回序列描述。

6.将序列写入文件 (6. Writing sequences to a file)

Biopython’s SeqIO (Sequence Input/Output) interface can be used to write sequences to files. Following is an example where a list of sequences are written to a FASTA file.

Biopython的SeqIO (序列输入/输出)接口可用于将序列写入文件。以下是将序列列表写入FASTA文件的示例。

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dnasequences = ["AAACGTGG", "TGAACCG", "GGTGCA", "CCAATGCG"]records = (SeqRecord(Seq(seq, generic_dna), str(index)) for index,seq in enumerate(sequences))with open("example.fasta", "w") as output_handle:
    SeqIO.write(

This code will result in a FASTA file with sequence ids starting from 0. If you want to give a custom id and a description you can create the records as follows.

此代码将生成一个FASTA文件，其序列ID从0开始。如果要提供自定义ID和说明，可以按以下方式创建记录。

sequences = ["AAACGTGG", "TGAACCG", "GGTGCA", "CCAATGCG"]
new_sequences = []i=1for 
    record = SeqRecord(
    new_sequences.append(record)with open("example.fasta", "w") as output_handle:
    SeqIO.write(

The SeqIO.write() function will return the number of sequences written.

SeqIO.write()函数将返回写入的序列数。

7.将FASTQ文件转换为FASTA文件 (7. Convert a FASTQ file to FASTA file)

We need to convert DNA data file formats in certain applications. For example, we can do file format conversions from FASTQ to FASTA as follows.

我们需要在某些应用程序中转换DNA数据文件格式。例如，我们可以按照以下步骤进行从FASTQ到FASTA的文件格式转换。

from Bio import SeqIOwith open("path/to/fastq/file.fastq", "r") as input_handle, open("path/to/fasta/file.fasta", "w") as output_handle:    sequences = SeqIO.parse(input_handle, "fastq")        
    count = SeqIO.write(sequences, output_handle, "fasta")        print("Converted %i records" % count)

If you want to convert a GenBank file to FASTA format,

如果要将GenBank文件转换为FASTA格式，

from Bio import SeqIO
with open("
    
    sequences = SeqIO.parse(input_handle, "genbank")
    count = SeqIO.write(sequences, output_handle, "fasta")
print("Converted %i records" % count)

8.将ID序列与ID列表分开 (8. Separate sequences by ids from a list of ids)

Assume that you have a list of sequence identifiers in a file named list.lst where you want to separate the corresponding sequences from a FASTA file. You can run the following and write those sequences to a file.

假设您有一个名为list.lst的文件中的序列标识符列表，您想在其中将相应的序列与FASTA文件分开。您可以运行以下命令，并将这些序列写入文件。

from Bio import SeqIOids = set(x[:-1] for x in open(path+"list.lst"))with open(path+'list.fq', mode='a') as my_output:
    
    for seq in SeqIO.parse(path+"list_sequences.fq", "fastq"):
        
        if seq.id in ids: 
            my_output.write(seq.format("fastq"))