NCBI Pathogen Detection project简介

NCBI Pathogen Detection project是一个集中式系统,整合来源于食品、环境和病人的细菌病原序列数据。

主要由两大部分组成:

  1. 为正在被进行病原体监测活动的菌株提供实时分析,找出这些菌株是否与数据库中菌株存在clonally related。
  2. 为这些病原体中发现的抗性基因提供全面的信息支持。

美国和国际上的一些公共卫生机构正在从临床病例、零售食品、工业生产设施和环境场所收集样本,以促进对病原体和食源性疾病的积极、实时监测。这些机构对样本进行排序,并将数据提交给NCBI,NCBI根据其数据库中的其他序列(包括GenBank中的所有基因组)进行分析,以确定密切相关的序列。其目的是通过将食品或环境中的分离物与人类疾病联系起来,发现潜在的污染源,并迅速向公共卫生科学家报告序列关系,以帮助追踪调查和疫情应对。

1数据来源

1.1 solation type (epi_type)

从biosample中选取类型3种: clinical OR environmental/other OR NULL.

  • If attribute_package=Pathogen.cl.1.0 then isolation type is clinical.
  • If attribute_package=Pathogen.env.1.0 then isolation type is environmental/other, unless host or isolation_source indicates that it was isolated from a human subject in which case isolation type is clinical.
  • If neither of these packages is used then isolation type is NULL.

1.2 Organism group

34个organism group,“organism group”页面罗列所有organism group的链接,以及每个group的统计信息。请注意,organism group表下的物种名称反映了每个组中最常见的物种,但并不反映所有物种。例如,Salmonella enterica organism group 包括重要的 Salmonella enterica isolates, 也包括 Salmonella bongori isolates.要查看每个group中存在的所有isolate,请参见“isolate”页面中的“ scientific_name”列。

 

附录1 编号规则

  1. PDG - Accession number prefix for a Pathogen Detection Organism Group.

Technical note: An organism group (PDG) contains one or more targets (PDTs). A PDT is a member of zero or one SNP cluster (PDS), and never more than one cluster. A SNP cluster is composed of two or more PDTs, and each ach PDS is completely contained within a PDG.
(Read more about organism groups in the data fields section of this document.)

 

2. PDS - Accession number prefix for a Pathogen Detection SNP Cluster.
(Read more about SNP clusters in the data fields section of this document.)

SNP cluster (erd_group) 

Pathogen SNP cluster accession. A SNP cluster is a group of isolates whose genome assemblies are closely related, depending on the clustering methodology used (as noted in the data processing section of this document).

The SNP cluster accession data field name is erd_group, in which "ERD" stands for Epidemiologically Related Distance(流行病学相关距离).

Each SNP cluster can be viewed as a phylogenetic distance tree in the SNP Tree Viewer.

Data field names and values are case sensitive, as shown in the examples below.

The first sample search below includes an accession.version number. If you don't know the latest version number for a SNP cluster, you can use an asterisk * as a wildcard, as in the second example below. If you enter an older version number that has since been superceded by a newer version of the SNP cluster, the Isolates Browser will display a message that links to the newer version. The PDS version changes when the membership of a SNP cluster changes.

A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project, and the data retention and history tracking section describes the use of accession.versions to track changes to the data.

Examples:

To search this field directly, enter a query such as:   erd_group:searchterm

Search for:   erd_group:PDS000003441.73

Search for:   erd_group:PDS000003441.*
with an asterisk (*) serving as a wildcard, if you don't know the version number of the SNP cluster accession.

Note: Because the SNP cluster accession is unique, it is not necessary to include the data field name in searches. It is sufficient to just enter the SNP cluster accession, if desired. For example the first search above can simply be entered as PDS000003441.73 into the Isolates Browser, and the second search can be entered as PDS000003441.*.

Either one of the search examples above will retrieve isolates that belong to a SNP cluster associated with an E. coli and Shigella outbreak that was traced to All-Purpose Flour. In that tree, the short branches that connect clinical and environmental samples indicate a high degree of similarity in the genome sequences of those isolates. (For more information about the All-Purpose Flour outbreak, see the section of this document on "How to identify the possible source of an outbreak.")

3. Serovar (serovar) 

包括:亚种、血清型或血清型的组合字段(如果提交人提供)。此字段包含的值是与数据提交者输入的值完全相同。数据字段名称和值区分大小写。

本文档的单独部分提供了有关使用引号进行短语搜索的提示,以及出现在亚种、血清型或血清型名称中的特殊字符。

Examples:

To search this field directly, enter a query such as:   serovar:searchterm

Search for:   serovar:"4,[5],12:b:-"

Search for:   serovar:"Shigella sonnei"

Search for:   serovar:Enteritidis

4. Strain (strain) 
微生物菌株名称,如果提交人提供。此字段包含的值与数据提交者输入的值完全相同。数据字段名称和值区分大小写。

Examples:

To search this field directly, enter a query such as:   strain:searchterm

Search for:   strain:FDA00010279

Search for:   strain:KCRI-598A

Search for:   strain::PNUSA*

5. Isolate (target_acc) 

Isolate的编号前缀为“PDT”开始,它代表病原体检测target。此数据库是发布PDT访问的主要资源。

每个目标是单个病原体分离物的基因组组装。有几种类型的基因组组合:

从序列读取中分离NCBI病原体数据处理管道组装的基因组,但未在GenBank中作为基因组序列记录发布

分离株作为组装基因组直接提交给GenBank,因此具有相应的“GCA”加入

分离由NCBI数据处理管道组装的基因组,然后由提交人或代表提交人在其许可下提交给GenBank,或在未经其许可的情况下提交给第三方注释(TPA)数据库。

数据字段名称和值区分大小写,并且加入前缀中的字母必须为大写,如下面的示例所示。(本文件的另一部分提供了病原体检测项目中出现的加入前缀列表。)


Pathogen Detection accession of the isolate. The accession begins with the prefix "PDT," which stands for Pathogen Detection Target. This database is the primary resource issuing PDT accessions.
Each target is the genome assembly for a single pathogen isolate. There are several types of genome assemblies

solate genomes assembled by the NCBI Pathogens data processing pipeline from sequence reads, but not published as genome sequence records in GenBank 

isolates submitted directly to GenBank as assembled genomes, and therefore have a corresponding "GCA" accession

isolate genomes assembled by the NCBI data processing pipeline and then submitted to GenBank either by the submitter or on behalf of the submitter with their permission, or without their permission into the Third Party Annotation (TPA) database.


Data field names and values are case sensitive, and the letters that are in the accession prefix must be in upper case, as shown in the example below. (A separate section of this document provides a list of accession prefixes that appear in the Pathogen Detection project.)
Examples:

To search this field directly, enter a query such as:   target_acc:searchterm

Search for:   target_acc:PDT000133982

6. Min-diff (mindiff) 
最小SNP距离,从这个isolate到一个different isolation type。例如从一个临床isolate到一个环境isolate。

Minimum SNP distance from this isolate to one of a different isolation type. For example, the minimum SNP distance from a clinical isolate to an environmental isolate, or vice versa.

A value will appear in the "Min-diff" column only if an isolate has been found, by the Pathogen Detection Project data processing pipeline, to belong to a SNP cluster and another isolate in that cluster has a different "Isolation type" that is not NULL. If it has, the isolate will contain a "PDS*" accession number in the "SNP cluster" column of the Isolates Browser, along with a value in the "Min-diff" and/or "Min-same" columns (depending upon the composition of the SNP cluster).

To view the SNP cluster for an isolate of interest, click on either the "PDT*" accession number in the "Isolate" column, or the "PDS*" accession number in the "SNP cluster" column. In the SNP Tree Viewer display, the branch lengths are proportional to the number of SNPs among the isolates in the cluster. Mouse over any branch to see its length.

Note that the value of Min-diff is n/a where the isolate does not have a value for isolation type. It is also n/a where there are no other isolates in the cluster that has a type opposite to this isolate's isolation type, or if the isolate is not in any SNP cluster.

To search for a range of values, enter a query such as:   mindiff:[value1 TO value2]   with square brackets surrounding the query string, and with the word "TO" written in upper case. Data field names and values are case sensitive, and this data field name should be written in all lower case. Alternatively Filters are a convenient way to search for ranges of values.

Examples:

  1. To search this field directly, enter a query such as:   mindiff:[value1 TO value2]
  2. Search for:   mindiff:[0 to 6]
    to retrieve isolates that are no more than 6 SNPs away from other isolates of the opposite isolate type within the same cluster. In other words, retrieve clinical isolates that have a distance of no more than 6 SNPs from environmental isolates in the same cluster, or vice versa.

数据处理过程

1. 基因组拼接(assembly)

使用SKESA程序进行de novo assemblies,用于illumina的数据提交处理

对于454和Ion Torrent测序数据使用reference assembly分析

2. Clustering

有两种不同的clustering pipelines用于这个项目。

使用reference wgMLST scheme确定每个拼接基因组上的loci和alleles。使用25-allele cut-off 来聚类相关的isolates。

使用k-mer距离来聚类相关的isolates,然后再用SNP分析。 使用50-SNP single-linkage clustering进行聚类。这个聚类方法将逐步被wgMLST的方法替代(但不包括小于1000个isolates的organisms)

以上两个pipelines, 一旦clusters被创建了,相应的reference assembly、related isolates、SNPs、系统发育树等将都被确定了。

3. 系统发育树的构建

对每个cluster,重构系统发育树,使用SNPs构建。方法为maximum compatibility criteria(文献pmid:28231758)

4. 注释和抗性基因/蛋白的鉴定

拼接后基因组的注释使用NCBI Prokaryotic Genome Annotation Pipeline (PGAP)。 抗性基(AMR)使用AMRFinderPlus 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值