NCBIdatasets:
Datasets - NCBIhttps://www.ncbi.nlm.nih.gov/datasets/
安装
windows下载链接:
https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/win64/datasets.exe
exe路径写入环境变量后cmd输入datasets出现提示即为安装成功
conda安装:
conda create -n ncbi_datasets
conda activate ncbi_datasets
conda install -c conda-forge ncbi-datasets-cli
使用
Examples
datasets download genome accession GCF_000001405.39 --chromosomes X,Y --exclude-gff3 --exclude-rna
datasets download genome taxon "bos taurus"
datasets download gene gene-id 672
datasets download gene symbol brca1 --taxon mouse
datasets download gene accession NP_000483.3
datasets download virus genome taxon sars-cov-2 --host dog
datasets download virus protein S --host dog --filename SARS2-spike-dog.zip
datasets download --input-json request_file.json --filename output.zip
其中基因组下载选项:
选择自己需要的数据进行下载
Flags
-a, --annotated only include genomes with annotation
--assembly-level string restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, contig, scaffold
--assembly-source string restrict assemblies to refseq or genbank only
--chromosomes strings limit to a specified, comma-delimited list of chromosomes (default [all])
--dehydrated download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
--exclude-genomic-cds exclude cds_from_genomic.fna (genomic cds file)
--exclude-gff3 exclude genomic.gff (gff3 annotation file)
--exclude-protein exclude protein.faa (protein sequence file)
--exclude-rna exclude rna.fna (transcript sequence file)
--exclude-seq exclude genomic.fna (genomic sequence file)
-h, --help help for genome
--include-gbff include genomic.gbff (GenBank flat file sequence and annotation), if available
--include-gtf include genomic.gtf (gtf annotation file), if available
--reference limit to reference and representative (GCF_ and GCA_) assemblies
--released-before string only include genomes that have been released before a specified date (MM/DD/YYYY)
--released-since string only include genomes that have been released after a specified date (MM/DD/YYYY)
--search strings only include genomes that have the specified text in the
searchable fields: species and infraspecies, assembly name and submitter
To provide multiple strings '--search' can be included multiple times
比如需要下载真菌(taxid:4751)基因组数据:
(taxid 可以通过NCBI搜索得到。其他下载选项可以通过命令 datasets download查看)
datasets download genome taxon "4751" --dehydrated --filename fungi_genome_dataset.zip --api-key 123456789abcdefghijk
因为数据量比较大先下载为json的压缩包形式 ,后面的--api-key防止短时请求次数过多被服务器屏蔽IP,api-key可以通过注册ncbi账号得到。
fungi_genome_dataset.zip下载完后解压到当前目录文件结构如下:
Archive: fungi_genome_dataset.zip
inflating:fungi_genome_dataset/README.md
inflating:fungi_genome_dataset/ncbi_dataset/data/*/assembly_data_report.jsonl
inflating:fungi_genome_dataset/ncbi_dataset/data/dataset_catalog.json
inflating:fungi_genome_dataset/ncbi_dataset/fetch.txt
下载
##提示not find 仔细检查路径格式
datasets rehydrate --directory fungi_genome_dataset/
datasets download genome:
Download a genome dataset including genome, transcript and protein sequence, annotation and a detailed data report.
Genome datasets can be specified by NCBI Assembly or BioProject accession or taxon. Datasets are downloaded as a zip file.
The default genome dataset includes the following files (if available):
* genomic.fna (genomic sequences)
* rna.fna (transcript sequences)
* protein.faa (protein sequences)
* genomic.gff (genome annotation in gff3 format)
* data_report.jsonl (data report with genome assembly and annotation metadata)
* dataset_catalog.json (a list of files and file types included in the dataset)
Refer to NCBI's [command line quickstart](https://www.ncbi.nlm.nih.gov/datasets/docs/quickstarts/command-line-tools/) documentation for information about getting started with the command-line tools.
Usage
datasets download genome [command]
Examples
datasets download genome accession GCF_000001405.39 --chromosomes X,Y --exclude-gff3 --exclude-rna
datasets download genome taxon "bos taurus" --dehydrated
datasets download genome taxon human --assembly-level chromosome,complete_genome --dehydrated
datasets download genome taxon mouse --search C57BL/6J --search "Broad Institute" --dehydrated
Available Commands
accession download a genome dataset by NCBI Assembly or BioProject accession
taxon download a genome dataset by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)
Flags
-a, --annotated only include genomes with annotation
--assembly-level string restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, contig, scaffold
--assembly-source string restrict assemblies to refseq or genbank only
--chromosomes strings limit to a specified, comma-delimited list of chromosomes (default [all])
--dehydrated download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
--exclude-genomic-cds exclude cds_from_genomic.fna (genomic cds file)
--exclude-gff3 exclude genomic.gff (gff3 annotation file)
--exclude-protein exclude protein.faa (protein sequence file)
--exclude-rna exclude rna.fna (transcript sequence file)
--exclude-seq exclude genomic.fna (genomic sequence file)
-h, --help help for genome
--include-gbff include genomic.gbff (GenBank flat file sequence and annotation), if available
--include-gtf include genomic.gtf (gtf annotation file), if available
--reference limit to reference and representative (GCF_ and GCA_) assemblies
--released-before string only include genomes that have been released before a specified date (MM/DD/YYYY)
--released-since string only include genomes that have been released after a specified date (MM/DD/YYYY)
--search strings only include genomes that have the specified text in the
searchable fields: species and infraspecies, assembly name and submitter
To provide multiple strings '--search' can be included multiple times
Global Flags
--api-key string NCBI Datasets API Key
--filename string specify a custom file name for the downloaded dataset (default "ncbi_dataset.zip")
--no-progressbar hide progress bar
Use datasets download genome help <command> for detailed help about a command.
datasets download gene:
Usage
datasets download gene [flags]
datasets download gene [command]
Examples
datasets download gene gene-id 672
datasets download gene symbol brca1 --taxon mouse
datasets download gene accession NP_000483.3
datasets download gene gene-id 2778 --fasta-filter NC_000020.11,NM_001077490.3,NP_001070958.1
Available Commands
gene-id download a gene dataset by NCBI Gene ID
symbol download a gene dataset by gene symbol
accession download a gene dataset by RefSeq nucleotide or protein accession
taxon download a gene dataset by taxon
Flags
--exclude-gene exclude gene.fna (gene sequence file)
--exclude-protein exclude protein.faa (protein sequence file)
--exclude-rna exclude rna.fna (transcript sequence file)
--fasta-filter strings limit gene fasta download to a specific list of accessions
--fasta-filter-file string file of accessions to limit gene fasta download
-h, --help help for gene
--include-3p-utr include 3p_utr.fna (3'-UTR sequence file)
--include-5p-utr include 5p_utr.fna (5'-UTR sequence file)
--include-cds include cds.fna (CDS sequence file)
Global Flags
--api-key string NCBI Datasets API Key
--filename string specify a custom file name for the downloaded dataset (default "ncbi_dataset.zip")
--no-progressbar hide progress bar
Use datasets download gene help <command> for detailed help about a command.