linux终端机 下载,使用linux终端下载GEO数据

Bioconductor的GEOquery几个常用函数可以实现GEO数据的下载,但有时候我想直接通过终端下载而不是使用Rstudio然后运行脚本的方式,所以下面用shell脚本对GEOquery两个下载函数getGEO()以及getGEOSuppFiles()进行了简单的封装。

安装使用clone命令

git clone https://github.com/ShixiangWang/mytoolkit/

点击页面右上方的克隆或下载按钮预置与帮助

Linux系统安装R,如果你没有安装GEOquery包,脚本会自动判断并进行下载安装。

查看脚本帮助:

./getGEOSuppFiles.sh -h

./getGEO.sh -h

./bulkGEO.sh -h

下载GEO附加文件

GEO附加文件一般是原始的芯片数据。

用法:

Usage: ./getGEOSuppFiles.sh -n GEO -d directory

GEO: GEO accession 号,比如 GPL1073 or GSM1137

directory: 下载到的目录,默认为你的当前目录。

下载GEO表达矩阵文件

这个是最常用的功能,下载芯片的表达矩阵文件,数据已经经过研究者的预处理,可以直接使用。

用法:

Usage: ./getGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -P getGPL

Detail of Options

==================

-n GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')

-d destdir: 要下载到的目的目录,默认为当前目录。

-M 逻辑值TRUE或FALSE,告诉脚本是否下载GSE Series Matrix文件,默认为TRUE。

-A 逻辑值TRUE或FALSE,告诉脚本是否使用注释GPL信息文件(会下载),这些文件包含了最新映射的Gene ID和其他基本信息,但不是都有。默认为TRUE。

-P 逻辑值TRUE或FALSE,告诉脚本是否在下载GSEMatrix文件时下载GPL信息,如果你知道你要用bioconductor工具的注释包,你可以选择FALSE,默认为TRUE。

Minimal Use Method

==================

If you do not know how to use these options, just set -n option is OK

Like

./getGEO.sh -n GEO

change the 'GEO' above to name of GSE you want to download

大量下载表达矩阵文件和原始文件

这个功能利用了前两个脚本,对它们进行循环调用。

用法:

Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp

Detail of Options

==================

-n GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')

-d destdir: 要下载到的目的目录,默认为当前目录。

-M 逻辑值TRUE或FALSE,告诉脚本是否下载GSE Series Matrix文件,默认为TRUE。

-A 逻辑值TRUE或FALSE,告诉脚本是否使用注释GPL信息文件(会下载),这些文件包含了最新映射的Gene ID和其他基本信息,但不是都有。默认为TRUE。

-f filename: 你可以把要下载的GEO对象名放入一个文件,然后指定它。注意,如果使用它,请不要设定-n选项,不然会被覆盖掉。

-s supp: 逻辑值TRUE或FALSE,设定是否要下载原始附加文件。

Minimal Use Method

==================

If you do not know how to use these options, just set -n option is OK

Like

./bulkGEO.sh -n 'GEO1 GEO2 GEO3'

change the 'GEO' above to name of GSE you want to download

昨天为了避免自我感觉的下载麻烦所以写了这些代码,因为对linux的脚本还不是很精通,脚本可能会存在问题。基本的下载不会出错,我已经调试过。如果有问题或其他功能,欢迎提问,我会尝试解决。

谢谢阅读~

------------------------------------------------------------------------------------------------------------

今天刚好在一个新机器上下载GEO数据,只装了一些基本的R包,可以看看效果。

[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -h

Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp

Detail of Options

==================

-nGEO:A character string representing GEO objects for download and parsing. (eg., 'GDS505','GSE2','GSM2','GPL96'), you can use space to seperate multiple objects. Or you can use the -f option to locate the file where you put names of GEO object.

-ddestdir:The destination directory for any downloads. Defaults to the current directory. You may want to specify a different directory if you want to save the file for later use. Doing so is a good idea if you have a slow connection, as some of the GEO files are HUGE!

-MA boolean telling GEOquery whether or not to use GSE Series Matrix files from GEO. The parsing of these files can be many orders-of-magnitude faster than parsing the GSE SOFT format files. Defaults to TRUE, meaning that the SOFT format parsing will not occur; set to FALSE if you for some reason need other columns from the GSE records.

-AA boolean defaulting to TRUE as to whether or not to use the Annotation GPL information. These files are nice to use because they contain up-to-date information remapped from Entrez Gene on a regular basis. However, they do not exist for all GPLs; in general, they are only available for GPLs referenced by a GDS

-ffilename: a character string specify the filename where GEO names stored.

-ssupp: A boolean defaulting to FALSE as to whether or not to download supplementary files.

Minimal Use Method

==================

If you do not know how to use these options, just set -n option is OK

Like

./bulkGEO.sh -n 'GEO1 GEO2 GEO3'

change the 'GEO*' above to name of GSE you want to download

[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -d ~/workspace/GEO_data/igcc_cnv/ -f ~/workspace/GEO_data/igcc_cnv/geo_names.txt

Package GEOquery not available. Atempting to install it.

Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help

BioC_mirror: https://bioconductor.org

Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.2 (2017-09-28).

Installing package(s) ‘GEOquery’

also installing the dependencies ‘BiocGenerics’, ‘Biobase’

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/BiocGenerics_0.24.0.tar.gz'

Content type 'application/x-gzip' length 43393 bytes (42 KB)

==================================================

downloaded 42 KB

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/Biobase_2.38.0.tar.gz'

Content type 'application/x-gzip' length 1656734 bytes (1.6 MB)

==================================================

downloaded 1.6 MB

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/GEOquery_2.46.13.tar.gz'

Content type 'application/x-gzip' length 13745245 bytes (13.1 MB)

==================================================

downloaded 13.1 MB

* installing *source* package ‘BiocGenerics’ ...

** R

** inst

** preparing package for lazy loading

Creating a new generic function for ‘append’ in package ‘BiocGenerics’

Creating a new generic function for ‘as.data.frame’ in package ‘BiocGenerics’

Creating a new generic function for ‘cbind’ in package ‘BiocGenerics’

Creating a new generic function for ‘rbind’ in package ‘BiocGenerics’

Creating a new generic function for ‘do.call’ in package ‘BiocGenerics’

Creating a new generic function for ‘duplicated’ in package ‘BiocGenerics’

Creating a new generic function for ‘anyDuplicated’ in package ‘BiocGenerics’

Creating a new generic function for ‘eval’ in package ‘BiocGenerics’

Creating a new generic function for ‘pmax’ in package ‘BiocGenerics’

Creating a new generic function for ‘pmin’ in package ‘BiocGenerics’

Creating a new generic function for ‘pmax.int’ in package ‘BiocGenerics’

Creating a new generic function for ‘pmin.int’ in package ‘BiocGenerics’

Creating a new generic function for ‘Reduce’ in package ‘BiocGenerics’

Creating a new generic function for ‘Filter’ in package ‘BiocGenerics’

Creating a new generic function for ‘Find’ in package ‘BiocGenerics’

Creating a new generic function for ‘Map’ in package ‘BiocGenerics’

Creating a new generic function for ‘Position’ in package ‘BiocGenerics’

Creating a new generic function for ‘get’ in package ‘BiocGenerics’

Creating a new generic function for ‘mget’ in package ‘BiocGenerics’

Creating a new generic function for ‘grep’ in package ‘BiocGenerics’

Creating a new generic function for ‘grepl’ in package ‘BiocGenerics’

Creating a new generic function for ‘is.unsorted’ in package ‘BiocGenerics’

Creating a new generic function for ‘lapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘sapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘lengths’ in package ‘BiocGenerics’

Creating a new generic function for ‘mapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘match’ in package ‘BiocGenerics’

Creating a new generic function for ‘rowSums’ in package ‘BiocGenerics’

Creating a new generic function for ‘colSums’ in package ‘BiocGenerics’

Creating a new generic function for ‘rowMeans’ in package ‘BiocGenerics’

Creating a new generic function for ‘colMeans’ in package ‘BiocGenerics’

Creating a new generic function for ‘order’ in package ‘BiocGenerics’

Creating a new generic function for ‘paste’ in package ‘BiocGenerics’

Creating a new generic function for ‘rank’ in package ‘BiocGenerics’

Creating a new generic function for ‘rownames’ in package ‘BiocGenerics’

Creating a new generic function for ‘colnames’ in package ‘BiocGenerics’

Creating a new generic function for ‘union’ in package ‘BiocGenerics’

Creating a new generic function for ‘intersect’ in package ‘BiocGenerics’

Creating a new generic function for ‘setdiff’ in package ‘BiocGenerics’

Creating a new generic function for ‘sort’ in package ‘BiocGenerics’

Creating a new generic function for ‘table’ in package ‘BiocGenerics’

Creating a new generic function for ‘tapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘unique’ in package ‘BiocGenerics’

Creating a new generic function for ‘unsplit’ in package ‘BiocGenerics’

Creating a new generic function for ‘var’ in package ‘BiocGenerics’

Creating a new generic function for ‘sd’ in package ‘BiocGenerics’

Creating a new generic function for ‘which’ in package ‘BiocGenerics’

Creating a new generic function for ‘which.max’ in package ‘BiocGenerics’

Creating a new generic function for ‘which.min’ in package ‘BiocGenerics’

Creating a new generic function for ‘IQR’ in package ‘BiocGenerics’

Creating a new generic function for ‘mad’ in package ‘BiocGenerics’

Creating a new generic function for ‘xtabs’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterCall’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterApply’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterApplyLB’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterEvalQ’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterExport’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterMap’ in package ‘BiocGenerics’

Creating a new generic function for ‘parLapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parSapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parApply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parRapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parCapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parLapplyLB’ in package ‘BiocGenerics’

Creating a new generic function for ‘parSapplyLB’ in package ‘BiocGenerics’

** help

*** installing help indices

** building package indices

** testing if installed package can be loaded

* DONE (BiocGenerics)

* installing *source* package ‘Biobase’ ...

** libs

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c Rinit.c -o Rinit.o

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c anyMissing.c -o anyMissing.o

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c envir.c -o envir.o

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c matchpt.c -o matchpt.o

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c rowMedians.c -o rowMedians.o

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c sublist_extract.c -o sublist_extract.o

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -shared -L/public/home/wangshx/anaconda3/lib/R/lib -Wl,-O2,--sort-common,--as-needed,-z,relro,-z,now -L/public/home/wangshx/anaconda3/lib -o Biobase.so Rinit.o anyMissing.o envir.o matchpt.o rowMedians.o sublist_extract.o -L/public/home/wangshx/anaconda3/lib/R/lib -lR

installing to /public/home/wangshx/anaconda3/lib/R/library/Biobase/libs

** R

** data

** inst

** preparing package for lazy loading

** help

*** installing help indices

** building package indices

** installing vignettes

** testing if installed package can be loaded

* DONE (Biobase)

* installing *source* package ‘GEOquery’ ...

** R

** inst

** preparing package for lazy loading

** help

*** installing help indices

** building package indices

** installing vignettes

** testing if installed package can be loaded

* DONE (GEOquery)

The downloaded source packages are in

‘/tmp/Rtmptc9bgw/downloaded_packages’

Updating HTML index of packages in '.Library'

Making 'packages.html' ... done

Old packages: 'hms', 'limma', 'Rcpp', 'tibble', 'xml2'

GEO: GSE76730

destdir: /public/home/wangshx/workspace/GEO_data/igcc_cnv/

GSEMatrix: TRUE

AnnotGPL: TRUE

getGPL: TRUE

Found 1 file(s)

GSE76730_series_matrix.txt.gz

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE76nnn/GSE76730/matrix/GSE76730_series_matrix.txt.gz'

Content type 'application/x-gzip' length 262447098 bytes (250.3 MB)

==================================================

downloaded 250.3 MB

Parsed with column specification:

cols(

.default = col_double(),

ID_REF = col_character()

)

See spec(...) for full column specifications.

Annotation GPL not available, so will use submitter GPL instead

File stored at:

/public/home/wangshx/workspace/GEO_data/igcc_cnv//GPL3718.soft

$GSE76730_series_matrix.txt.gz

ExpressionSet (storageMode: lockedEnvironment)

assayData: 261981 features, 190 samples

element names: exprs

protocolData: none

phenoData

sampleNames: GSM2036728 GSM2036729 ... GSM2036917 (190 total)

varLabels: title geo_accession ... who performance status:ch1 (61

total)

varMetadata: labelDescription

featureData

featureNames: SNP_A-1780270 SNP_A-1780272 ... SNP_A-4241299 (261981

total)

fvarLabels: ID Affy SNP ID ... SPOT_ID (27 total)

fvarMetadata: Column Description labelDescription

experimentData: use 'experimentData(object)'

Annotation: GPL3718

Warning message:

In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :

cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL3nnn/GPL3718/annot/GPL3718.annot.gz': HTTP status was '404 Not Found'

The files of GSE76730 download successfully!

The files of GSE76730 download successfully!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值