Rcurl+Python selenium爬取NCBI启动子序列进行FIMO meme实战

啥都会一点的DJX

已于 2024-05-12 01:02:49 修改

阅读量535

点赞数 27

分类专栏：生信分析文章标签： python selenium 开发语言 r语言

于 2024-05-11 16:36:07 首次发布

本文链接：https://blog.csdn.net/DJXtxdy/article/details/138706767

版权

生信分析专栏收录该内容

6 篇文章 13 订阅

订阅专栏

#问题来源
博主近期阅读一篇文献，《Transcriptomics of Hirschsprung disease patient-derived enteric neural crest cells reveals a role for oxidative phosphorylation》，作者使用FIMO meme分析了关键转录因子HDAC的motif（可视为是该转录因子的结合位点的序列特征）可能的靶基因启动子结合位点，以确定后续的研究分子。图片描述：Motifenrichment of HDAC1 at the promoter region (−2 to 1kb relative to TSS) of its target genes.
HDAC1DNA-binding motif is shown in the left panel. TSS transcription start site. P
values (Pval) were calculated by meme FIMO software.
FIMO meme分析motif下游结合位点：
FIMO meme的使用说明参照https://www.jianshu.com/p/37eadb126bbf，需要准备1.转录因子motif，可自jaspar数据库下载（https://jaspar.elixir.no/）或自行下载序列分析，该数据库可下载ChIP-seq中相应蛋白质结合DNA序列的特征；
下载SOX10的motif，meme格式
2.靶基因的启动子序列fastq（转录因子起始位点（TSS）上游2kb-下游1kb），自NCBI网站下载，注意需要手动改序列范围
选择fastq
复制序列

但是，当做大通量筛选的时候需要一个一个复制序列文件，非常麻烦！于是博主查阅资料，发现大多数是下载全序列的，并不能指定序列的区域。于是博主写了一个自动下载的代码，主要使用Rcurl爬取静态网页和Python selenium爬取动态网页批量下载根据TSS指定范围的序列，整合后进行大批量FIMO分析。参考blog《R语言批量爬取NCBI基因注释数据》（https://www.jianshu.com/p/4236a40adeb9）、《selenium入门超详细教程——网页自动化操作》（https://blog.csdn.net/kobepaul123/article/details/128796839）。

#R环境准备
##加载包

library(RCurl)
library(stringr)
library(XML)
library(clusterProfiler)
library(httr)
library(reticulate)

##参数设置

#parmameter
#genes: a character contains gene SYMBOL
genes <- c("PHOX2B","RET","PHOX2A","NOTCH1","S100B") 
#start_: sequence start position compare to TSS
start_ <- -2000
#end_: sequence end position compare to TSS
end_ <- 1000

##Rcurl配置
下载ssl证书并指定存放位置，下载地址https://curl.se/ca/cacert.pem，设施访问header（从浏览器网页，按F12进入开发者模式获取）

#set SSL certificate position
options(RCurlOptions = list(cainfo = "D:/R/cacert.pem",
                            ssl.verifypeer=FALSE))
myheader <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0"

##一些函数，引用自前述的blog

# get node text accroding to xpath：
getNodesTxt <- function(html_txt1,xpath_p){
  els1 = getNodeSet(html_txt1, xpath_p)
  # remove " "：
  els1_txt <- sapply(els1,xmlValue)[!(sapply(els1,xmlValue)=="")]
  # remove "\n"：
  str_replace_all(els1_txt,"(\\n )+","")
}

# The format of the processing node is character and the value of length 0 is NA：
dealNodeTxt <- function(NodeTxt){
  ifelse(is.character(NodeTxt)==T && length(NodeTxt)!=0 , NodeTxt , NA)
}

##基因转换为Entrze ID，并生成NCBI网址

# change gene symbol to entrze ID:
genes <- bitr(genes, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
# get NCBI url
genes$NCBI_url <- paste("https://www.ncbi.nlm.nih.gov/gene/",genes$ENTREZID,sep="")
head(genes)

##使用循环下载数据，分为几个步骤
###1.使用Rcurl包进入NCBI对应基因的网站，获取Summary信息，并使用xpath定位并获取fasta的网址
定位fasta网址
###2.编辑fasta序列网址，以https://www.ncbi.nlm.nih.gov/nuccore/NC_000007.14?report=fasta&from=97732086&to=97740472为例，该网址from参数是转录起始位点（TSS），我们需要转录起始位点前2kb-后1kb的序列，因此获取from参数与start_相加得到新的起点，与end_相加得到新的终点，随后再拼接为新的网址

###3.随后进入新的网址，这里博主最初用Rcurl爬取后发现并无序列内容，查阅资料发现NCBI的序列是动态加载的，因此需要使用处理动态网页的工具进行爬取。博主最初打算在R中跑完，但发现Rwebdrive包和Rselenium包不是很好用，配置麻烦而且对新版本的selenium兼容性不是很好，因此采用了一小段python代码进行爬取，注意检查计算机的python环境。
####cmd中运行where python查看python位置，然后[python.exe路径] -m pip install selenium
python中安装selenium
####4.根据浏览器版本下载dirver，这里用的是edge浏览器 https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/?form=MA13LH
下载edge driver
####5.python脚本，构建爬取函数（缺少模块的话pip安装一下），大致流程是根据提供的url开启浏览器模拟动态访问，等待5秒加载之后获取网页内容，在定位序列位置

#python script
import time
import lxml.html as lh
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

def getseq(url):
    options = webdriver.EdgeOptions()
    edge_path="D:/R/msedgedriver.exe" #edge driver path
    service = Service(executable_path=edge_path)
    options.add_argument('--headless') #not open browser window
    browser = webdriver.Edge(options=options,service=service)
    browser.get(url)
    time.sleep(5) #wait for page load
    xpath = '//*[@id="viewercontent1"]/pre/text()'
    html_source = browser.page_source
    html = (lh.fromstring(html_source))
    items = html.xpath(xpath)
    browser.close()
    return items

####6.在R中加载python脚本，这样就有构建的函数了，并且可以直接在R中调用python

source_python("D:/Python/pythonProject/NCBI-DNA-sequence/Main_script.py")

####7.在R中调用getseq函数获取序列及基因表达位置信息，整合到genes中

循环完整代码如下：

for(i in 1:nrow(genes)){
  # get gene web:
  doc <- getURL(genes[i,"NCBI_url"],header=myheader)
  html_txt1 = htmlParse(doc, asText = TRUE)
  
  # get summary：
  genes[i,"Summary"] <- dealNodeTxt(getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[preceding-sibling::dt[text()="Summary" and position()=1 ] ]'))
  cat("write Summary\n")
  
  # set sequence web
  url_seq <- unlist(getNodeSet(html_txt1,"//a[@title=\"Nucleotide FASTA report\"]/@href"))
  url_seq <- paste0("https://www.ncbi.nlm.nih.gov",url_seq)
  url_split <- strsplit(url_seq,"=")
  TSS_point <- url_split[[1]][3]
  TSS_point <- gsub("&to","",TSS_point)
  if (exists("start_")){
    start_point <- as.numeric(TSS_point)+start_
    genes[i,"Start"] <- start_point
    url_split[[1]][3] <- paste0(start_point,"&to")
    url_seq <- paste(url_split[[1]],collapse = "=")
  }else{
    genes[i,"Start"] <- TSS_point
  }
  if (exists("end_")){
    end_point <- as.numeric(TSS_point)+end_
    genes[i,"End"] <- start_point
    url_split[[1]][4] <- end_point
    url_seq <- paste(url_split[[1]],collapse = "=")
  }else{
    genes[i,"End"] <- url_split[[1]][4]
  }
  print(url_seq)
  # get sequence web
  doc <- getURL(url_seq,header=myheader)
  html_txt2 = htmlParse(doc, asText = TRUE)
  gene <- getNodeSet(html_txt2,"//*[@id=\"viewercontent1\"]/pre/text()")
  
  #get sequence
  seq <- getseq(url_seq)
  seq_split <-  strsplit(seq," ")
  genes[i,"Ano"] <- gsub(">","",seq_split[[1]][1])
  genes[i,"chromosome"] <- gsub(",","",paste(seq_split[[1]][2:5],collapse = " "))
  genes[i,"Ref"] <- seq_split[[1]][6]
  genes[i,"Seq"] <- gsub("Assembly","",seq_split[[1]][8])
  cat("write sequence\t")
  print(paste("complete ",i,"genes!"))
}

###生成序列整合文本

#intergrate seq file
seq_inter <- paste(paste0(paste0(">",genes$SYMBOL),genes$Seq),collapse = "")
write.table(seq_inter,"genes_seq_intergrate.txt",col.names = F,row.names = F,quote = F)

序列文本
这样就可以直接上传到FIMO中进行scan了！

#小结
这篇教程中综合了Rcurl和Python selenium的方法，对动态网页的数据进行爬取和整合，可以任意指定基因序列进行下载。然而爬取的速度较慢，仍有优化的部分，如对每个url都调用了getseq方法，导致每个基因都会打开并关闭一次浏览器，增加了任务时间等。

啥都会一点的DJX

关注

27
点赞
踩
25

收藏

觉得还不错? 一键收藏
打赏
1
评论
Rcurl+Python selenium爬取NCBI启动子序列进行FIMO meme实战

博主查阅资料，发现大多数是下载全序列的，并不能指定序列的区域。于是博主写了一个自动下载的代码，主要使用Rcurl爬取静态网页和Python selenium爬取动态网页批量下载根据TSS指定范围的序列，整合后进行大批量FIMO分析。
复制链接

扫一扫