isoformDB 数据收集

Qiusc1999

已于 2024-03-27 10:19:38 修改

阅读量40

点赞数

文章标签：数据库

于 2022-05-25 11:19:23 首次发布

本文链接：https://blog.csdn.net/Qiusc1999/article/details/124962694

版权

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

前言

记录数据收集过程

各物种数据收集

植物

GOA
Ensembl Plant
http://plants.ensembl.org/index.html

小鼠 Mus_musculus

gene组注释数据（ensembl ID）
https://asia.ensembl.org/Mus_musculus/Info/Index

Mus_musculus.GRCm39.106.gtf.gz                     28-Feb-2022 07:42            31334959

uniprot GOA （uniprot ID）
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/

ID mapping （uniprot -> ensembl）
将id从uniprot映射到ensembl中
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/

MOUSE_10090_idmapping.dat.gz                                    2022-05-25 10:00   15M

小麦 Triticum aestivum

小麦基因组注释
https://www.wheatgenome.org/Resources/Annotations/RefSeq-v2.1-Assembly-and-Annotation-now-freely-available-at-URGI-and-NCBI#
Ensembl Plant
http://plants.ensembl.org/index.html

GOA
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/

04/29/2022 01:36下午 16,916,743,414 goa_uniprot_all.gaf.gz

大豆 Glycine max

数据格式

Uniprot GOA

下载地址: https://www.ebi.ac.uk/GOA/downloads （浏览器注意使用IE模式）
论文：UniProt-GOA: A Central Resource for Data Integration and GO Annotation.

文件格式

3. Data types
-------------

a) DB
Database from which annotated entity has been taken.
Examples: UniProtKB, PDB

b) DB_Object_ID
A unique identifier in the database for the item being annotated.
Examples: O00165, 10GS_B

c) DB_Object_Symbol
A unique and valid symbol (gene name) that corresponds to the DB_Object_ID.
An officially approved gene symbol will be used in this field when available.
Alternatively, other gene symbols or locus names are applied.
If no symbols are available, the DB_Object_ID will be used.
Examples: G6PC
CYB561
MGCQ309F3
10GS_B

d) Qualifier
In the GAF format, this column is used for flags that modify the interpretation of an annotation. The values that may be present in this field are: NOT, colocalizes_with, contributes_to, NOT|contributes_to, NOT|colocalizes_with.

In the GPAD format, this column is used for explicit relations between the entity and the GO term. An entry in this column is required in this file format.
The default relations are part_of (for Cellular Component), involved_in (for Biological Process) or enables (for Molecular Function). Other values that may be present in this field are: colocalizes_with and contributes_to. Any of these relations can be additionally qualified with 'NOT'.
Example: NOT|involved_in

e) GO ID
The GO identifier for the term attributed to the DB_Object_ID.
Example: GO:0005634

f) DB:Reference
A single reference cited to support an annotation. Where an annotation cannot reference a paper, this field will contain a GO_REF identifier. See
http://www.geneontology.org/doc/GO.references for an explanation of the reference types used.
Examples: PMID:9058808
DOI:10.1046/j.1469-8137.2001.00150.x
GO_REF:0000002
GO_REF:0000020

g) Evidence Code
In the GAF format, this column is used for one of the evidence codes supplied by the GO Consortium (http://www.geneontology.org/GO.evidence.shtml).
Example: IDA

In the GPAD format, this column is used for identifiers from the Evidence Code Ontology (http://evidenceontology.googlecode.com/svn/trunk/eco.obo)
Example: ECO:0000320

h) With (or) From
Additional identifier(s) to support annotations using certain evidence codes (including IEA, IPI, IGI, IMP, IC and ISS evidences).
Examples: UniProtKB:O00341
InterPro:IPROO1878
RGD:123456
CHEBI:12345
Ensembl:ENSG00000136141
GO:0000001
EC:3.1.22.1

i) Aspect
One of the three ontologies, corresponding to the GO identifier applied.
P (biological process), F (molecular function) or C (cellular component).
Example: P

j) DB_Object_Name
The full entity name will be present here, if available from the resource that supplies the object identifier. If a name cannot be added, this field will be left empty.
Examples: Glucose-6-phosphatase
Cellular tumor antigen p53
Coatomer subunit beta

k)  DB_Object_Synonym
Alternative gene symbol(s) or identifiers are provided pipe-separated, if available from the supplying resource. If none of these identifiers
have been supplied, the field will be left empty.
Example:  RNF20|BRE1A|BRE1A_BOVIN
MMP-16

l) DB_Object_Type
The kind of entity being annotated.
Examples: protein, protein_structure, complex

m) Taxon
Identifier for the species being annotated or the gene product being defined. In the GAF format, an interacting taxon ID (see n) below) may be included in this column using a pipe to separate it from the primary taxon ID.
Example: taxon:9606

n) Interacting_Taxon_ID
This field is only supplied by the goa_uniprot_all.gpa and goa_uniprot_gcrp.gpa files, and has been separated from the dual taxon ID format allowed in the goa_uniprot_all.gaf and goa_uniprot_gcrp.gaf files.
This taxon ID should inform on the other organism involved in a multi-species interaction. An interacting taxon identifier can only be used in conjunction with terms that have the biological process term 'GO:0051704; multi-organism process' or the cellular component term 'GO:0044215; other organism' as an ancestor. This taxon ID should inform on the other organism involved in the interaction. For further information please see: http://geneontology.org/page/go-annotation-conventions#interactions
Example: taxon:9606

o) Date
The date of last annotation update in the format 'YYYYMMDD'
Example: 20050101

p) Assigned_By
Attribution for the source of the annotation.
Examples: UniProtKB, AgBase

q) Annotation_Extension
Contains cross references to other ontologies/databases that can be used to qualify or enhance the GO term applied in the annotation.
The cross-reference is prefaced by an appropriate GO relationship; references to multiple ontologies can be entered as linked (comma separated) or independent (pipe separated) statements.
Examples: part_of(CL:0000084)
occurs_in(GO:0009536)
has_input(CHEBI:15422)
has_output(CHEBI:16761)
has_regulation_target(UniProtKB:P12345)|has_regulation_target(UniProtKB:P54321)
part_of(CL:0000017),part_of(MA:0000415)

r) Gene_Product_Form_ID
The unique identifier of a specific spliceform of the DB_Object_ID.
Example: O43526-2

s) Annotation_Properties
This column is reserved for internal use; it will not be populated in public files

t) Parent_Object_ID
This field supplies the relationship between the DB_Object_ID and the canonical UniProtKB accession number or Complex Portal macromolecular complex identifier, where the DB_Object_ID is an isoform identifier or the subunit of a complex.
Examples:
UniProtKB:P21678
ComplexPortal:CP-2342163

u) DB_Xref(s)
This field supplies alternative identifiers (cross-references) for the DB_Object_ID.
This field will not be populated in the GOA files.

v) Gene_Product_Properties
This field can be populated with information concerning the DB_Object_ID. The syntax of the field will conform to a pipe-separated list of "property_name=property_value". There is a controlled vocabulary for the property names. The GOA files will use this field to indicate:

i) DB_Subset
The database subset from which the entity being described has been taken. This information will only be supplied for UniProtKB, where this field will be one of Swiss-Prot or TrEMBL.
Examples:
db_subset=Swiss-Prot
db_subset=TrEMBL

ii) Annotation_Target_Set
A description of the list in which the entity has been included for prioritized annotation.
Examples:
target_set=BHF-UCL
target_set=KRUK
target_set=ReferenceGenome

kofamKOALA获得KO注释

安装kofamKOALA

https://zhuanlan.zhihu.com/p/375740435
hmmer 安装版本为3.3，新版本无法处理>100K的氨基酸序列

before running:
source activate kofam
export PATH=/home/common/scqiu/kegg/kofamKOALA/env/ruby/bin/:$PATH

使用seqkit提取isoform序列

教程

https://blog.csdn.net/weixin_45044758/article/details/118097119

Usage:
  seqkit subseq [flags]

Flags:
      --bed string        by tab-delimited BED file
      --chr strings       select limited sequence with sequence IDs when using --gtf or --bed (multiple value supported, case ignored)
  -d, --down-stream int   down stream length
      --feature strings   select limited feature types (multiple value supported, case ignored, only works with GTF)
      --gtf string        by GTF (version 2.2) file
      --gtf-tag string    output this tag as sequence comment (default "gene_id")
  -h, --help              help for subseq
  -f, --only-flank        only return up/down stream sequence
  -r, --region string     by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for cutting first 12 bases. type "seqkit subseq -h" for more examples
  -u, --up-stream int     up stream length

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)


#根据bed、gtf文件提取基因
seqkit subseq --bed bedfile.bed -o gene.fa genomefile.fa
seqkit subseq --gtf gtffile.bed -o gene.fa genomefile.fa

使用

注意，seqkit提取出来的isoform序列id使用的其gene id，导致提取出来的序列文件有多个相同的gene id，而没有transcript id，因此需要指定seqkit所提取的序列以及对应的id。

using seqkit to extract sequence:
./seqkit subseq --feature transcript --gtf-tag transcript_id --gtf data/human/Homo_sapiens.GRCh38.107.chr.gtf -o data/human/extracted_dna_seqs.fa data/human/Homo_sapiens.GRCh38.dna.primary_assembly.fa

注意需要使用gtf的注释文件

将gff3转gtf

https://zhuanlan.zhihu.com/p/260832132

conda install -c bioconda gffread 
# gff3转gtf
gffread gencode.v19.annotation.gff3 -T -o gencode.v19.gtf
# gtf转gff3
gffread gencode.vM13.annotation.gtf -o gencode.vM13.annotation.gff3

处理序列数据

整理规范id与序列

file_processed=open('processed_dna_seqs.fa','a')
file=open('extracted_dna_seqs.fa','r')
for line in file.readlines():
    if '>' in line:
        file_processed.write('>'+line.replace('\n','').split()[1]+'\n')
    else:
        file_processed.write(line)

file.close()
file_processed.close()

将DNA序列转为氨基酸序列

使用seqkit

using seqkit to translate dna seq to protein seq:
./seqkit translate data/human/processed_dna_seqs.fa -o data/human/processed_protein_seqs_clean.fa --clean

进行KO注释

注意要设置--tmp-dir，防止同时进行多个物种注释时，缓存文件被覆盖。

./exec_annotation -o human_kofam.out --cpu 32 --tmp-dir=./human_tmp -e 0.01 ../data/human/processed_protein_seqs_clean.fa

进度

python 处理代码未编写，数据库平台未添加对应功能
人类：注释中
老鼠：文件处理中

使用eggNOG-mapper 得到COG注释

参考对应文章
与其不同的是，对isoform进行COG注释，直接使用上述KO注释使用的氨基酸序列--itype proteins。

python emapper.py -i sequences/triticum_aestivum/processed_protein_seqs_clean.fa --output output/triticum_aestivum_protein_out --tax_scope root --itype proteins --cpu 16

进度

尚未编写sql生成代码，数据库进行对应适配
人类：完成注释
玉米：完成注释
老鼠：完成注释

Qiusc1999

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
isoformDB 数据收集

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言小麦GOA数据收集前言提示：这里可以添加本文要记录的大概内容：提示：以下是本篇文章正文内容，下面案例可供参考小麦GOA数据收集小麦基因组注释：https://www.wheatgenome.org/Resources/Annotations/RefSeq-v2.1-Assembly-and-Annotation-now-freely-available-at-URGI-and-NCBI#GOA: https:/.
复制链接

扫一扫