linux终端机下载,使用linux终端下载GEO数据

最新推荐文章于 2024-01-03 20:39:06 发布

转载最新推荐文章于 2024-01-03 20:39:06 发布 · 823 阅读

文章标签：

本文介绍了一种使用Shell脚本封装Bioconductor的GEOquery包中getGEO()和getGEOSuppFiles()函数的方法，以简化GEO数据的下载过程。该脚本支持从终端直接下载GEO数据，包括表达矩阵文件和附加文件。

Bioconductor的GEOquery几个常用函数可以实现GEO数据的下载，但有时候我想直接通过终端下载而不是使用Rstudio然后运行脚本的方式，所以下面用shell脚本对GEOquery两个下载函数getGEO()以及getGEOSuppFiles()进行了简单的封装。

安装使用clone命令

git clone https://github.com/ShixiangWang/mytoolkit/

点击页面右上方的克隆或下载按钮预置与帮助

Linux系统安装R，如果你没有安装GEOquery包，脚本会自动判断并进行下载安装。

查看脚本帮助：

./getGEOSuppFiles.sh -h

./getGEO.sh -h

./bulkGEO.sh -h

下载GEO附加文件

GEO附加文件一般是原始的芯片数据。

用法：

Usage: ./getGEOSuppFiles.sh -n GEO -d directory

GEO: GEO accession 号，比如 GPL1073 or GSM1137

directory: 下载到的目录，默认为你的当前目录。

下载GEO表达矩阵文件

这个是最常用的功能，下载芯片的表达矩阵文件，数据已经经过研究者的预处理，可以直接使用。

用法：

Usage: ./getGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -P getGPL

Detail of Options

==================

-n GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')

-d destdir: 要下载到的目的目录，默认为当前目录。

-M 逻辑值TRUE或FALSE，告诉脚本是否下载GSE Series Matrix文件，默认为TRUE。

-A 逻辑值TRUE或FALSE，告诉脚本是否使用注释GPL信息文件(会下载)，这些文件包含了最新映射的Gene ID和其他基本信息，但不是都有。默认为TRUE。

-P 逻辑值TRUE或FALSE，告诉脚本是否在下载GSEMatrix文件时下载GPL信息，如果你知道你要用bioconductor工具的注释包，你可以选择FALSE，默认为TRUE。

Minimal Use Method

==================

If you do not know how to use these options, just set -n option is OK

./getGEO.sh -n GEO

change the 'GEO' above to name of GSE you want to download

大量下载表达矩阵文件和原始文件

这个功能利用了前两个脚本，对它们进行循环调用。

用法：

Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp

Detail of Options

==================

-n GEO: 代表GEO对象的字符 (比如 'GDS505','GSE2','GSM2','GPL96')

-d destdir: 要下载到的目的目录，默认为当前目录。

-M 逻辑值TRUE或FALSE，告诉脚本是否下载GSE Series Matrix文件，默认为TRUE。

-A 逻辑值TRUE或FALSE，告诉脚本是否使用注释GPL信息文件(会下载)，这些文件包含了最新映射的Gene ID和其他基本信息，但不是都有。默认为TRUE。

-f filename: 你可以把要下载的GEO对象名放入一个文件，然后指定它。注意，如果使用它，请不要设定-n选项，不然会被覆盖掉。

-s supp: 逻辑值TRUE或FALSE，设定是否要下载原始附加文件。

Minimal Use Method

==================

If you do not know how to use these options, just set -n option is OK

./bulkGEO.sh -n 'GEO1 GEO2 GEO3'

change the 'GEO' above to name of GSE you want to download

昨天为了避免自我感觉的下载麻烦所以写了这些代码，因为对linux的脚本还不是很精通，脚本可能会存在问题。基本的下载不会出错，我已经调试过。如果有问题或其他功能，欢迎提问，我会尝试解决。

谢谢阅读~

------------------------------------------------------------------------------------------------------------

今天刚好在一个新机器上下载GEO数据，只装了一些基本的R包，可以看看效果。

[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -h

Usage: ./bulkGEO.sh -n GEO -d destdir -M GSEMatrix -A AnnotGPL -f filename -s supp

Detail of Options

==================

-nGEO:A character string representing GEO objects for download and parsing. (eg., 'GDS505','GSE2','GSM2','GPL96'), you can use space to seperate multiple objects. Or you can use the -f option to locate the file where you put names of GEO object.

-ddestdir:The destination directory for any downloads. Defaults to the current directory. You may want to specify a different directory if you want to save the file for later use. Doing so is a good idea if you have a slow connection, as some of the GEO files are HUGE!

-MA boolean telling GEOquery whether or not to use GSE Series Matrix files from GEO. The parsing of these files can be many orders-of-magnitude faster than parsing the GSE SOFT format files. Defaults to TRUE, meaning that the SOFT format parsing will not occur; set to FALSE if you for some reason need other columns from the GSE records.

-AA boolean defaulting to TRUE as to whether or not to use the Annotation GPL information. These files are nice to use because they contain up-to-date information remapped from Entrez Gene on a regular basis. However, they do not exist for all GPLs; in general, they are only available for GPLs referenced by a GDS

-ffilename: a character string specify the filename where GEO names stored.

-ssupp: A boolean defaulting to FALSE as to whether or not to download supplementary files.

Minimal Use Method

==================

If you do not know how to use these options, just set -n option is OK

./bulkGEO.sh -n 'GEO1 GEO2 GEO3'

change the 'GEO*' above to name of GSE you want to download

[wangshx@HPC-login mytoolkit]$ ./bulkGEO.sh -d ~/workspace/GEO_data/igcc_cnv/ -f ~/workspace/GEO_data/igcc_cnv/geo_names.txt

Package GEOquery not available. Atempting to install it.

Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help

BioC_mirror: https://bioconductor.org

Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.2 (2017-09-28).

Installing package(s) ‘GEOquery’

also installing the dependencies ‘BiocGenerics’, ‘Biobase’

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/BiocGenerics_0.24.0.tar.gz'

Content type 'application/x-gzip' length 43393 bytes (42 KB)

==================================================

downloaded 42 KB

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/Biobase_2.38.0.tar.gz'

Content type 'application/x-gzip' length 1656734 bytes (1.6 MB)

==================================================

downloaded 1.6 MB

trying URL 'https://bioconductor.org/packages/3.6/bioc/src/contrib/GEOquery_2.46.13.tar.gz'

Content type 'application/x-gzip' length 13745245 bytes (13.1 MB)

==================================================

downloaded 13.1 MB

* installing *source* package ‘BiocGenerics’ ...

** R

** inst

** preparing package for lazy loading

Creating a new generic function for ‘append’ in package ‘BiocGenerics’

Creating a new generic function for ‘as.data.frame’ in package ‘BiocGenerics’

Creating a new generic function for ‘cbind’ in package ‘BiocGenerics’

Creating a new generic function for ‘rbind’ in package ‘BiocGenerics’

Creating a new generic function for ‘do.call’ in package ‘BiocGenerics’

Creating a new generic function for ‘duplicated’ in package ‘BiocGenerics’

Creating a new generic function for ‘anyDuplicated’ in package ‘BiocGenerics’

Creating a new generic function for ‘eval’ in package ‘BiocGenerics’

Creating a new generic function for ‘pmax’ in package ‘BiocGenerics’

Creating a new generic function for ‘pmin’ in package ‘BiocGenerics’

Creating a new generic function for ‘pmax.int’ in package ‘BiocGenerics’

Creating a new generic function for ‘pmin.int’ in package ‘BiocGenerics’

Creating a new generic function for ‘Reduce’ in package ‘BiocGenerics’

Creating a new generic function for ‘Filter’ in package ‘BiocGenerics’

Creating a new generic function for ‘Find’ in package ‘BiocGenerics’

Creating a new generic function for ‘Map’ in package ‘BiocGenerics’

Creating a new generic function for ‘Position’ in package ‘BiocGenerics’

Creating a new generic function for ‘get’ in package ‘BiocGenerics’

Creating a new generic function for ‘mget’ in package ‘BiocGenerics’

Creating a new generic function for ‘grep’ in package ‘BiocGenerics’

Creating a new generic function for ‘grepl’ in package ‘BiocGenerics’

Creating a new generic function for ‘is.unsorted’ in package ‘BiocGenerics’

Creating a new generic function for ‘lapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘sapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘lengths’ in package ‘BiocGenerics’

Creating a new generic function for ‘mapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘match’ in package ‘BiocGenerics’

Creating a new generic function for ‘rowSums’ in package ‘BiocGenerics’

Creating a new generic function for ‘colSums’ in package ‘BiocGenerics’

Creating a new generic function for ‘rowMeans’ in package ‘BiocGenerics’

Creating a new generic function for ‘colMeans’ in package ‘BiocGenerics’

Creating a new generic function for ‘order’ in package ‘BiocGenerics’

Creating a new generic function for ‘paste’ in package ‘BiocGenerics’

Creating a new generic function for ‘rank’ in package ‘BiocGenerics’

Creating a new generic function for ‘rownames’ in package ‘BiocGenerics’

Creating a new generic function for ‘colnames’ in package ‘BiocGenerics’

Creating a new generic function for ‘union’ in package ‘BiocGenerics’

Creating a new generic function for ‘intersect’ in package ‘BiocGenerics’

Creating a new generic function for ‘setdiff’ in package ‘BiocGenerics’

Creating a new generic function for ‘sort’ in package ‘BiocGenerics’

Creating a new generic function for ‘table’ in package ‘BiocGenerics’

Creating a new generic function for ‘tapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘unique’ in package ‘BiocGenerics’

Creating a new generic function for ‘unsplit’ in package ‘BiocGenerics’

Creating a new generic function for ‘var’ in package ‘BiocGenerics’

Creating a new generic function for ‘sd’ in package ‘BiocGenerics’

Creating a new generic function for ‘which’ in package ‘BiocGenerics’

Creating a new generic function for ‘which.max’ in package ‘BiocGenerics’

Creating a new generic function for ‘which.min’ in package ‘BiocGenerics’

Creating a new generic function for ‘IQR’ in package ‘BiocGenerics’

Creating a new generic function for ‘mad’ in package ‘BiocGenerics’

Creating a new generic function for ‘xtabs’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterCall’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterApply’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterApplyLB’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterEvalQ’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterExport’ in package ‘BiocGenerics’

Creating a new generic function for ‘clusterMap’ in package ‘BiocGenerics’

Creating a new generic function for ‘parLapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parSapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parApply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parRapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parCapply’ in package ‘BiocGenerics’

Creating a new generic function for ‘parLapplyLB’ in package ‘BiocGenerics’

Creating a new generic function for ‘parSapplyLB’ in package ‘BiocGenerics’

** help

*** installing help indices

** building package indices

** testing if installed package can be loaded

* DONE (BiocGenerics)

* installing *source* package ‘Biobase’ ...

** libs

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -I/public/home/wangshx/anaconda3/lib/R/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -I/public/home/wangshx/anaconda3/include -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/public/home/wangshx/anaconda3/include -c Rinit.c -o Rinit.o

/public/home/wangshx/anaconda3/bin/x86_64-conda_cos6-linux-gnu-cc -shared -L/public/home/wangshx/anaconda3/lib/R/lib -Wl,-O2,--sort-common,--as-needed,-z,relro,-z,now -L/public/home/wangshx/anaconda3/lib -o Biobase.so Rinit.o anyMissing.o envir.o matchpt.o rowMedians.o sublist_extract.o -L/public/home/wangshx/anaconda3/lib/R/lib -lR

installing to /public/home/wangshx/anaconda3/lib/R/library/Biobase/libs

** R

** data

** inst

** preparing package for lazy loading

** help

*** installing help indices

** building package indices

** installing vignettes

** testing if installed package can be loaded

* DONE (Biobase)

* installing *source* package ‘GEOquery’ ...

** R

** inst

** preparing package for lazy loading

** help

*** installing help indices

** building package indices

** installing vignettes

** testing if installed package can be loaded

* DONE (GEOquery)

The downloaded source packages are in

‘/tmp/Rtmptc9bgw/downloaded_packages’

Updating HTML index of packages in '.Library'

Making 'packages.html' ... done

Old packages: 'hms', 'limma', 'Rcpp', 'tibble', 'xml2'

GEO: GSE76730

destdir: /public/home/wangshx/workspace/GEO_data/igcc_cnv/

GSEMatrix: TRUE

AnnotGPL: TRUE

getGPL: TRUE

Found 1 file(s)

GSE76730_series_matrix.txt.gz

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE76nnn/GSE76730/matrix/GSE76730_series_matrix.txt.gz'

Content type 'application/x-gzip' length 262447098 bytes (250.3 MB)

==================================================

downloaded 250.3 MB

Parsed with column specification:

cols(

.default = col_double(),

ID_REF = col_character()

)

See spec(...) for full column specifications.

Annotation GPL not available, so will use submitter GPL instead

File stored at:

/public/home/wangshx/workspace/GEO_data/igcc_cnv//GPL3718.soft

$GSE76730_series_matrix.txt.gz

ExpressionSet (storageMode: lockedEnvironment)

assayData: 261981 features, 190 samples

element names: exprs

protocolData: none

phenoData

sampleNames: GSM2036728 GSM2036729 ... GSM2036917 (190 total)

varLabels: title geo_accession ... who performance status:ch1 (61

total)

varMetadata: labelDescription

featureData

featureNames: SNP_A-1780270 SNP_A-1780272 ... SNP_A-4241299 (261981

total)

fvarLabels: ID Affy SNP ID ... SPOT_ID (27 total)

fvarMetadata: Column Description labelDescription

experimentData: use 'experimentData(object)'

Annotation: GPL3718

Warning message:

In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :

cannot open URL 'https://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL3nnn/GPL3718/annot/GPL3718.annot.gz': HTTP status was '404 Not Found'

The files of GSE76730 download successfully!

linux终端机 下载,使用linux终端下载GEO数据

linux终端机下载,使用linux终端下载GEO数据