NCBI assembly_summary_genbank.txt, assembly_summary_refseq.txt和assembly_summary.txt文件内容格式解析

################################################################################

README for the assembly_summary_genbank.txt, assembly_summary_refseq.txt and assembly_summary.txt files found on the NCBI genomes FTP site:  ftp://ftp.ncbi.nlm.nih.gov/genomes

Last updated: February 11, 2020

################################################################################

1. ASSEMBLY SUMMARY FILES

The assembly_summary files report metadata for the genome assemblies on the NCBI genomes FTP site.

Four master files reporting data for either GenBank or RefSeq genome assemblies are available under ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/

  • assembly_summary_genbank.txt            - current GenBank genome assemblies
  • assembly_summary_genbank_historical.txt - replaced and suppressed GenBank genome assemblies
  • assembly_summary_refseq.txt             - current RefSeq genome assemblies
  • assembly_summary_refseq_historical.txt  - replaced and suppressed RefSeq genome assemblies

assembly_summary_genbank.txt and assembly_summary_genbank_historical.txt are also available at:

ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt

ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank_historical.txt

assembly_summary_refseq.txt and assembly_summary_refseq_historical.txt are also available at:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq_historical.txt

The assembly_summary.txt files in the directories named for taxonomic groups or species contain the relevant subsets of the data from the master files.

2. COLUMN SPECIFICATIONS

The assembly_summary.txt files have 22 tab-delimited columns. Header rows begin with '#".

Column  1: "assembly_accession":组装版本号

   Assembly accession: the assembly accession.version reported in this field is a unique identifier for the set of sequences in this particular version of the genome assembly.

  Column  2: "bioproject":BioProject编号

   BioProject: accession for the BioProject which produced the sequences in the genome assembly. A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a

   consortium. A BioProject record provides users a single place to find links

   to the diverse data types generated for that project. The record can be

   retrieved from the NCBI BioProject resource:

   https://www.ncbi.nlm.nih.gov/bioproject/  

Column  3: "biosample":

   BioSample: accession for the BioSample from which the sequences in the genome assembly were obtained. A BioSample record contains a description of the biological source material used in experimental assays. The record can be retrieved from the NCBI BioSample resource: https://www.ncbi.nlm.nih.gov/biosample/

Column  4: "wgs_master":GenBank核酸版本号

   WGS-master: the GenBank Nucleotide accession and version for the master record of the Whole Genome Shotgun (WGS) project for the genome assembly. The master record can be retrieved from the NCBI Nucleotide resource: https://www.ncbi.nlm.nih.gov/nuccore

   Genome assemblies that are complete genomes, and those that are clone-based, do not have WGS-master records in which case this field will be empty.

Column  5: "refseq_category"

   RefSeq Category: whether the assembly is a reference or representative genome in the NCBI Reference Sequence (RefSeq) project classification.

   Values:

           reference genome      - a manually selected high quality genome

                                   assembly that NCBI and the community have

                                   identified as being important as a standard

                                   against which other data are compared

           representative genome - a genome computationally or manually selected

                                   as a representative from among the best

                                   genomes available for a species or clade that

                                   does not have a designated reference genome

           na                    - no RefSeq category assigned to this assembly

   Prokaryotes may have more than one reference or representative genome per

   species. For more information see:

   https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#referencegenome

   Eukaryotes have no more than one reference or representative genome per

   species. If there are no assemblies in RefSeq for a particular eukaryotic

   species, then the GenBank assembly that RefSeq would select as the best

   available for that species will be designated as the representative genome.

   Viruses may have one or more reference genomes per species. The

   representative genome designation is not applied to viruses.

Column  6: "taxid":NCBI分类ID号

   Taxonomy ID: the NCBI taxonomy identifier for the organism from which the genome assembly was derived. The NCBI Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. The taxonomy record can be retrieved from the NCBI Taxonomy resource:

   https://www.ncbi.nlm.nih.gov/taxonomy/

Column  7: "species_taxid":

   Species taxonomy ID: the NCBI taxonomy identifier for the species from which

   the genome assembly was derived. The species taxid will differ from the organism taxid (column 6) only when the organism was reported at a sub-species or strain level.

Column  8: "organism_name":物种名称,与第6列的分类ID对应

   Organism name: the scientific name of the organism from which the sequences in the genome assembly were derived. This name is taken from the NCBI Taxonomy record for the taxid specified in column 6. Some older taxids were assigned at the strain level and for these the organism name will include the strain. Current practice is only to assign taxids at the species level; for these the organism name will be just the species, however, the strain name will be reported in the infraspecific_name field (column 9).

Column  9: "infraspecific_name":种之下的分类名称,例如菌株、品系、品种等等

   Infraspecific name: the strain, breed, cultivar or ecotype of the organism

   from which the sequences in the genome assembly were derived. Data are

   reported in the form tag=value, e.g. strain=AF16. Strain, breed, cultivar

   and ecotype are not expected to be used together, however, if they are then

   they will be reported in a list separated by ", /". Empty if no strain,

   breed, cultivar or ecotype is specified on the genomic sequence records.

Column 10: "isolate":

   Isolate: the individual isolate from which the sequences in the genome assembly were derived. Empty if no isolate is specified on the genomic sequence records.

Column 11: "version_status":组装版本状态信息,包括最新版本、已经被替换、已删除版本

   Version status: the release status for the genome assembly version.

   Values:

           latest     - the most recent of all the versions for this assembly chain

           replaced   - this version has been replaced by a newer version of the assembly in the same chain

           suppressed - this version of the assembly has been suppressed 

  An assembly chain is the collection of all versions for the same assembly accession.

Column 12: "assembly_level":组装级别,包括完整基因组、染色体、Scaffold、Contig

   Assembly level: the highest level of assembly for any object in the genome assembly.

   Values:

      Complete genome - all chromosomes are gapless and have no runs of 10 or more ambiguous bases (Ns), there are no unplaced or unlocalized scaffolds, and all the expected chromosomes

                        are present (i.e. the assembly is not noted as having

                        partial genome representation). Plasmids and organelles

                        may or may not be included in the assembly but if

                        present then the sequences are gapless.

      Chromosome      - there is sequence for one or more chromosomes. This could be a completely sequenced chromosome without gaps or a chromosome containing scaffolds or contigs with gaps between them. There may also be unplaced or unlocalized scaffolds.

      Scaffold        - some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized.

      Contig          - nothing is assembled beyond the level of sequence contigs

Column 13: "release_type":组装版本发布类型,主要根据变化、更新程度分为major、minor和patch

   Release type: whether this version of the genome assembly is a major, minor or patch release.

   Values:

           Major - changes from the previous assembly version result in a

                   significant change to the coordinate system. The first

                   version of an assembly is always a major release. Most

                   subsequent genome assembly updates are also major releases.

           Minor - changes from the previous assembly version are limited to the

                   following changes, none of which result in a significant

                   change to the coordinate system of the primary assembly-unit:

                   - adding, removing or changing a non-nuclear assembly-unit

                   - dropping unplaced or unlocalized scaffolds

                   - adding up to 50 unplaced or unlocalized scaffolds which are

                     shorter than the current scaffold-N50 value

                   - replacing a component with a gap of the same length

           Patch - the only change is the addition or modification of a patch assembly-unit.

   See the NCBI Assembly model web page (https://www.ncbi.nlm.nih.gov/assembly/model/#asmb_def) for definitions of assembly-units and genome patches.

Column 14: "genome_rep":目标是全基因组组装还是部分组装,包括full和partial

   Genome representation: whether the goal for the assembly was to represent the whole genome or only part of it.

   Values:

      Full    - the data used to generate the assembly was obtained from the whole genome, as in Whole Genome Shotgun (WGS) assemblies for example. There may still be gaps in the assembly.

      Partial - the data used to generate the assembly came from only part of the genome.

   Most assemblies have full genome representation with a minority being partial genome representation. See the Assembly help web page (https://www.ncbi.nlm.nih.gov/assembly/help/) for reasons that the genome representation would be set to partial.

Column 15: "seq_rel_date":序列在数据库发布的日期

   Sequence release date: the date the sequences in the genome assembly were released in the International Nucleotide Sequence Database Collaboration (INSDC) databases, i.e. DDBJ, ENA or GenBank.

Column 16: "asm_name":基因组提交者姓名或者NCBI指定的ASM开头的编号

   Assembly name: the submitter's name for the genome assembly, when one was provided, otherwise a default name, in the form ASM#####v#, is provided by NCBI. Assembly names are not unique.

Column 17: "submitter":基因组提交组织名称,当有多个组织时只列出第一单位

   Submitter: the submitting consortium or first position if a list of organizations. The full submitter information is available in the NCBI BioProject resource: www.ncbi.nlm.nih.gov/bioproject/

Column 18: "gbrs_paired_asm":GenBank版本号和RefSeq版本号之间的对应关系,na表示没有匹配上

   GenBank/RefSeq paired assembly: the accession.version of the GenBank assembly that is paired to the given RefSeq assembly, or vice-versa. "na" is reported if the assembly is unpaired.

Column 19: "paired_asm_comp":对应的GenBank和RefSeq组装是否一致

   Paired assembly comparison: whether the paired GenBank & RefSeq assemblies are identical or different.

   Values:

      identical - GenBank and RefSeq assemblies are identical

      different - GenBank and RefSeq assemblies are not identical

      na        - not applicable since the assembly is unpaired

Column 20: "ftp_path":FTP下载地址(最重要的一列,可以下载到序列)

   FTP path: the path to the directory on the NCBI genomes FTP site from which data for this genome assembly can be downloaded.

Column 21: "excluded_from_refseq":从RefSeq中删除的原因

   Excluded from RefSeq: reasons the assembly was excluded from the NCBI Reference Sequence (RefSeq) project, including any assembly anomalies. See: https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/

注:NCBI RefSeq旨在提供高质量的参考序列,一些partial的基因序列可能会被删除,因此原因可能是:partial

Column 22: "relation_to_type_material":基因组组装所用样本的来源

   Relation to type material: contains a value if the sequences in the genome assembly were derived from type material.

   Values:

      assembly from type material - the sequences in the genome assembly were derived from type material

      assembly from synonym type material - the sequences in the genome assembly

         were derived from synonym type material

      assembly from pathotype material - the sequences in the genome assembly

         were derived from pathovar material

      assembly designated as neotype - the sequences in the genome assembly

         were derived from neotype material

      assembly designated as reftype - the sequences in the genome assembly

         were derived from reference material where type material never was

         available and is not likely to ever be available

      ICTV species exemplar - the International Committee on Taxonomy of Viruses (ICTV, 国际病毒分类学委员会) designated the genome assembly as the exemplar for the virus species

      ICTV additional isolate - the International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly an additional isolate for the virus species

3. 如何使用assembly_summary.txt文件

The metadata provided in the assembly_summary.txt files can be used to identify assemblies of interest for subsequent download.

The Genomes FTP FAQ provides examples of how to use the assembly_summary.txt files to download sets of assemblies. See:

How can I download only the current version of each assembly?

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#current

How can I download RefSeq data for all complete bacterial genomes?

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete

Other sets of assemblies of interest can be downloaded using variations on these instructions.

________________________________________________________________________________

National Center for Biotechnology Information (NCBI)

National Library of Medicine

National Institutes of Health

8600 Rockville Pike

Bethesda, MD 20894, USA

tel: (301) 496-2475

fax: (301) 480-9241

e-mail: info@ncbi.nlm.nih.gov

________________________________________________________________________________

 

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值