测序数据处理 —— SRA 数据检索
介绍
pysradb
是用 Python
编写的工具,提供了许多简单的命令来访问 SRA
和 ENA
数据库的测序数据的元数据信息,也可以用它来下载数据。
该软件提供的各种 ID
的转换很好用,对于批量下载数据并需要对应到测序数据的样本编号(GSMXXXXX
)时很好用,下面简单介绍一下它的使用
安装
使用 pip
安装
pip install pysradb
使用 conda
安装
conda install -c bioconda pysradb
或者安装到虚拟环境中
conda create -c bioconda -n pysradb PYTHON=3.10 pysradb
使用
命令行
安装完之后,在命令行键入
pysradb -h
可以看到提供了很多子命令,大部分都是 ID
的转换
元数据
获取元数据
pysradb metadata SRP098789 | grep 'RNA-Seq'
SRP098789 Selective stalling of human translation through small molecule engagement of the ribosome nascent chain SRX2536422 GSM2476016: Vehicle, 60 min, rep 4-ribo-seq; Homo sapiens; RNA-Seq GSM2476016: Vehicle, 60 min, rep 4-ribo-seq; Homo sapiens; RNA-Seq 9606 Homo sapiens RNA-Seq TRANSCRIPTOMIC cDNA SINGLE SRS1956372 SAMN06293468 PRJNA369742 Illumina HiSeq 2500 Illumina HiSeq 2500 ILLUMINA 66480991 1564636133 SRR5227307 66480991 3390530541
SRP098789 Selective stalling of human translation through small molecule engagement of the ribosome nascent chain SRX2536424 GSM2476018: vehicle, 60 min, rep 5-Ribo-seq; Homo sapiens; RNA-Seq GSM2476018: vehicle, 60 min, rep 5-Ribo-seq; Homo sapiens; RNA-Seq 9606 Homo sapiens RNA-Seq TRANSCRIPTOMIC cDNA SINGLE SRS1956374 SAMN06293466 PRJNA369742 Illumina HiSeq 2500 Illumina HiSeq 2500 ILLUMINA 40062613 904488287 SRR5227309 40062613 2043193263
SRP098789 Selective stalling of human translation through small molecule engagement of the ribosome nascent chain SRX2536426 GSM2476020: vehicle, 60 min, rep 4-mRNAseq; Homo sapiens; RNA-Seq GSM2476020: vehicle, 60 min, rep 4-mRNAseq; Homo sapiens; RNA-Seq 9606 Homo sapiens RNA-Seq TRANSCRIPTOMIC cDNA SINGLE SRS1956376 SAMN06293496 PRJNA369742 Illumina HiSeq 2500 Illumina HiSeq 2500 ILLUMINA 63720205 1416818619 SRR5227311 63720205 3249730455
SRP098789 Selective stalling of human translation through small molecule engagement of the ribosome nascent chain SRX2536428 GSM2476022: vehicle, 60 min, rep 5-mRNAseq; Homo sapiens; RNA-Seq GSM2476022: vehicle, 60 min, rep 5-mRNAseq; Homo sapiens; RNA-Seq 9606 Homo sapiens RNA-Seq TRANSCRIPTOMIC cDNA SINGLE SRS1956378 SAMN06293494 PRJNA369742 Illumina HiSeq 2500 Illumina HiSeq 2500 ILLUMINA 69422931 1545681856 SRR5227313 69422931 3540569481
会直接打印到控制台,这样不太好看,可以将结果保存到文件中
pysradb metadata SRP098789 --saveto metadata.txt
搜索
pysradb
提供了对三个数据库的检索,参数还是蛮多的,但一般很少会用它来检索数据,都是直接在数据库中检索到想要的数据编号,再用它来查询。所以不做过多的介绍,感兴趣的可以看看它的参数详情
pysradb search -h
usage: pysradb search [-h] [-o SAVETO] [-s] [-g [GRAPHS]] [-d {ena,geo,sra}]
[-v {0,1,2,3}] [--run-description] [--detailed] [-m MAX]
[-q QUERY [QUERY ...]] [-A ACCESSION]
[-O ORGANISM [ORGANISM ...]] [-L {SINGLE,PAIRED}]
[-M MBASES] [-D PUBLICATION_DATE]
[-P PLATFORM [PLATFORM ...]]
[-E SELECTION [SELECTION ...]] [-C SOURCE [SOURCE ...]]
[-S STRATEGY [STRATEGY ...]] [-T TITLE [TITLE ...]]
[-G GEO_QUERY [GEO_QUERY ...]]
[-Y GEO_DATASET_TYPE [GEO_DATASET_TYPE ...]]
[-Z GEO_ENTRY_TYPE [GEO_ENTRY_TYPE ...]]
下载
使用 download
子命令可以下载数据,可以根据 SRX
、SRP
和 GEO
编号进行下载,
--out-dir OUT_DIR Output directory root
--srx SRX [SRX ...], -x SRX [SRX ...]
Download only these SRX(s)
--srp SRP [SRP ...], -p SRP [SRP ...]
SRP ID
--geo GEO [GEO ...], -g GEO [GEO ...]
GEO ID
--skip-confirmation, -y
Skip confirmation
--use_ascp, -a Use aspera instead of wget
--col COL Specify column to download
--threads THREADS, -t THREADS
Number of threads
pysradb
默认使用 wget
来下载数据的,可以借助 aspera-client
来快速下载数据
pysradb download -t 8 --use_ascp -p SRP002605
转换
转换命令的参数基本上都是一样的,主要包含 4
个可选参数
pysradb gse-to-gsm -h
usage: pysradb gse-to-gsm [-h] [--saveto SAVETO] [--detailed] [--desc] [--expand] gse_ids [gse_ids ...]
positional arguments:
gse_ids
optional arguments:
-h, --help show this help message and exit
--saveto SAVETO Save output to file
--detailed Output additional columns: [sample_accession (SRS), run_accession (SRR), sample_alias (GSM), run_alias (GSM_r)]
--desc Should sample_attribute be included
--expand Should sample_attribute be expanded
例如,将 GSE
转换为 SRP
pysradb gse-to-srp GSE41637
study_alias study_accession
GSE41637 SRP016501
列出某个 GSEXXXX
数据集下面样本编号的对应关系
pysradb gse-to-gsm GSE41637 | head
study_alias experiment_alias experiment_accession
SRP016501 GSM1020692 SRX196316
SRP016501 GSM1020691 SRX196315
SRP016501 GSM1020690 SRX196314
SRP016501 GSM1020689 SRX196313
SRP016501 GSM1020688 SRX196312
SRP016501 GSM1020687 SRX196311
SRP016501 GSM1020686 SRX196310
SRP016501 GSM1020685 SRX196309
SRP016501 GSM1020684 SRX196308
GSM
转换为 SRR
pysradb gsm-to-srr GSM1020686
experiment_alias run_accession
GSM1020686 SRR594439
Python 代码
命令行虽然好用,但是不好对数据直接处理,在编写自动化下载数据的代码时,这种方式会更方便很多。
例如,使用函数直接可以获取 pandas DataFrame
对象
from pysradb.sraweb import SRAweb
db = SRAweb()
df = db.sra_metadata("SRP265425", detailed=True)
df.head()
# run_accession study_accession ... ena_fastq_ftp_1 ena_fastq_ftp_2
# 0 SRR11886735 SRP265425 ... <NA> <NA>
# 1 SRR11886736 SRP265425 ... <NA> <NA>
# 2 SRR11886737 SRP265425 ... <NA> <NA>
# 3 SRR11886738 SRP265425 ... <NA> <NA>
# 4 SRR11886739 SRP265425 ... <NA> <NA>
# [5 rows x 59 columns]
下载数据
from pysradb.sraweb import SRAweb
db = SRAweb()
db.download("SRP098789", use_ascp=True, threads=8)
或者在数据库中检索
from pysradb.search import SraSearch
instance = SraSearch(verbosity=1, query="Yap1 KO liver", return_max=5)
instance.search()
instance.get_df()
# run_accession experiment_title
# 0 SRR21643127 GSM6594777: YKTH (YAP1 KO TAZ Heterozygote) [L...
# 1 SRR21643128 GSM6594776: YKTH (YAP1 KO TAZ Heterozygote) [L...
# 2 SRR21643129 GSM6594775: WT (mixed C57BL/6-FVB background) ...
# 3 SRR21643130 GSM6594774: WT (mixed C57BL/6-FVB background) ...
# 4 SRR21643131 GSM6594773: WT (mixed C57BL/6-FVB background) ...
ID
转换
db.gse_to_srp('GSE67305')
# study_alias study_accession
# 1 GSE67305 SRP056576
列出详细信息
df = db.gse_to_gsm('GSE67305', detailed=True)
df.head()
# experiment_alias gds ... experiment_accession study_alias
# 2 GSM1644123 ... SRX969124 SRP056576
# 3 GSM1644122 ... SRX969123 SRP056576
# 4 GSM1644121 ... SRX969122 SRP056576
# 5 GSM1644120 ... SRX969121 SRP056576
# 6 GSM1644119 ... SRX969120 SRP056576
# [5 rows x 29 columns]
基本使用方式就这些,更详细的用法可以参考:https://saket-choudhary.me/pysradb/index.html