利用NCBIdatasets批量下载大规模生信数据集

NCBIdatasets:

Datasets - NCBIhttps://www.ncbi.nlm.nih.gov/datasets/

安装

windows下载链接:

https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/win64/datasets.exe

exe路径写入环境变量后cmd输入datasets出现提示即为安装成功

conda安装:

conda create -n ncbi_datasets

conda activate ncbi_datasets

conda install -c conda-forge ncbi-datasets-cli

使用

Examples
  datasets download genome accession GCF_000001405.39 --chromosomes X,Y --exclude-gff3 --exclude-rna
  datasets download genome taxon "bos taurus"
  datasets download gene gene-id 672
  datasets download gene symbol brca1 --taxon mouse
  datasets download gene accession NP_000483.3
  datasets download virus genome taxon sars-cov-2 --host dog
  datasets download virus protein S --host dog --filename SARS2-spike-dog.zip
  datasets download --input-json request_file.json --filename output.zip

其中基因组下载选项:

选择自己需要的数据进行下载

Flags
  -a, --annotated                only include genomes with annotation
      --assembly-level string    restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, contig, scaffold
      --assembly-source string   restrict assemblies to refseq or genbank only
      --chromosomes strings      limit to a specified, comma-delimited list of chromosomes (default [all])
      --dehydrated               download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
      --exclude-genomic-cds      exclude cds_from_genomic.fna (genomic cds file)
      --exclude-gff3             exclude genomic.gff (gff3 annotation file)
      --exclude-protein          exclude protein.faa (protein sequence file)
      --exclude-rna              exclude rna.fna (transcript sequence file)
      --exclude-seq              exclude genomic.fna (genomic sequence file)
  -h, --help                     help for genome
      --include-gbff             include genomic.gbff (GenBank flat file sequence and annotation), if available
      --include-gtf              include genomic.gtf (gtf annotation file), if available
      --reference                limit to reference and representative (GCF_ and GCA_) assemblies
      --released-before string   only include genomes that have been released before a specified date (MM/DD/YYYY)
      --released-since string    only include genomes that have been released after a specified date (MM/DD/YYYY)
      --search strings           only include genomes that have the specified text in the
                                 searchable fields: species and infraspecies, assembly name and submitter
                                 To provide multiple strings '--search' can be included multiple times

比如需要下载真菌(taxid:4751)基因组数据:

(taxid 可以通过NCBI搜索得到。其他下载选项可以通过命令 datasets download查看)

datasets download genome taxon "4751" --dehydrated --filename fungi_genome_dataset.zip --api-key 123456789abcdefghijk

因为数据量比较大先下载为json的压缩包形式 ,后面的--api-key防止短时请求次数过多被服务器屏蔽IP,api-key可以通过注册ncbi账号得到。

fungi_genome_dataset.zip下载完后解压到当前目录文件结构如下:

Archive:  fungi_genome_dataset.zip
  inflating:fungi_genome_dataset/README.md
  inflating:fungi_genome_dataset/ncbi_dataset/data/*/assembly_data_report.jsonl
  inflating:fungi_genome_dataset/ncbi_dataset/data/dataset_catalog.json
  inflating:fungi_genome_dataset/ncbi_dataset/fetch.txt

下载

##提示not find 仔细检查路径格式
datasets rehydrate --directory fungi_genome_dataset/

datasets download genome:


Download a genome dataset including genome, transcript and protein sequence, annotation and a detailed data report.
Genome datasets can be specified by NCBI Assembly or BioProject accession or taxon. Datasets are downloaded as a zip file.

The default genome dataset includes the following files (if available):
* genomic.fna (genomic sequences)
* rna.fna (transcript sequences)
* protein.faa (protein sequences)
* genomic.gff (genome annotation in gff3 format)
* data_report.jsonl (data report with genome assembly and annotation metadata)
* dataset_catalog.json (a list of files and file types included in the dataset)

Refer to NCBI's [command line quickstart](https://www.ncbi.nlm.nih.gov/datasets/docs/quickstarts/command-line-tools/) documentation for information about getting started with the command-line tools.

Usage
  datasets download genome [command]

Examples
  datasets download genome accession GCF_000001405.39 --chromosomes X,Y --exclude-gff3 --exclude-rna
  datasets download genome taxon "bos taurus" --dehydrated
  datasets download genome taxon human --assembly-level chromosome,complete_genome --dehydrated
  datasets download genome taxon mouse --search C57BL/6J --search "Broad Institute" --dehydrated

Available Commands
  accession   download a genome dataset by NCBI Assembly or BioProject accession
  taxon       download a genome dataset by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)

Flags
  -a, --annotated                only include genomes with annotation
      --assembly-level string    restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, contig, scaffold
      --assembly-source string   restrict assemblies to refseq or genbank only
      --chromosomes strings      limit to a specified, comma-delimited list of chromosomes (default [all])
      --dehydrated               download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
      --exclude-genomic-cds      exclude cds_from_genomic.fna (genomic cds file)
      --exclude-gff3             exclude genomic.gff (gff3 annotation file)
      --exclude-protein          exclude protein.faa (protein sequence file)
      --exclude-rna              exclude rna.fna (transcript sequence file)
      --exclude-seq              exclude genomic.fna (genomic sequence file)
  -h, --help                     help for genome
      --include-gbff             include genomic.gbff (GenBank flat file sequence and annotation), if available
      --include-gtf              include genomic.gtf (gtf annotation file), if available
      --reference                limit to reference and representative (GCF_ and GCA_) assemblies
      --released-before string   only include genomes that have been released before a specified date (MM/DD/YYYY)
      --released-since string    only include genomes that have been released after a specified date (MM/DD/YYYY)
      --search strings           only include genomes that have the specified text in the
                                 searchable fields: species and infraspecies, assembly name and submitter
                                 To provide multiple strings '--search' can be included multiple times


Global Flags
      --api-key string    NCBI Datasets API Key
      --filename string   specify a custom file name for the downloaded dataset (default "ncbi_dataset.zip")
      --no-progressbar    hide progress bar

Use datasets download genome help <command> for detailed help about a command.

datasets download gene:

Usage
  datasets download gene [flags]
  datasets download gene [command]

Examples
  datasets download gene gene-id 672
  datasets download gene symbol brca1 --taxon mouse
  datasets download gene accession NP_000483.3
  datasets download gene gene-id 2778 --fasta-filter NC_000020.11,NM_001077490.3,NP_001070958.1

Available Commands
  gene-id     download a gene dataset by NCBI Gene ID
  symbol      download a gene dataset by gene symbol
  accession   download a gene dataset by RefSeq nucleotide or protein accession
  taxon       download a gene dataset by taxon

Flags
      --exclude-gene               exclude gene.fna (gene sequence file)
      --exclude-protein            exclude protein.faa (protein sequence file)
      --exclude-rna                exclude rna.fna (transcript sequence file)
      --fasta-filter strings       limit gene fasta download to a specific list of accessions
      --fasta-filter-file string   file of accessions to limit gene fasta download
  -h, --help                       help for gene
      --include-3p-utr             include 3p_utr.fna (3'-UTR sequence file)
      --include-5p-utr             include 5p_utr.fna (5'-UTR sequence file)
      --include-cds                include cds.fna (CDS sequence file)


Global Flags
      --api-key string    NCBI Datasets API Key
      --filename string   specify a custom file name for the downloaded dataset (default "ncbi_dataset.zip")
      --no-progressbar    hide progress bar

Use datasets download gene help <command> for detailed help about a command.

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值