- 我需要干啥:
在该页面中提交序列(http://bioinf.cs.ucl.ac.uk/psipred/)
例如序列:MNYKELEKMLDVIFENSEIKEIDLFFDPEVEISKQEFEDLVKNADPLQKVVGDNYITETFEWWEFENQYLEFELDYYVKDEKIFVLEMHFWRKIR
- 提交之后,网站会跳转到这个界面(http://bioinf.cs.ucl.ac.uk/psipred/&uuid=68a83dbc-b8dc-11ea-b7bb-00163e100d53),然后我需要点击Get Svg按钮,获得该svg图片。但是,由于我有2500个序列,一张一张点击,那太浪费时间了,而且这网站访问加载巨慢。
- 接下来讲下,我一开始的处理思路,我是想按照正常的爬虫,我把整个页面下载下来,然后正则获取svg标签内的内容再另存为一个文件就好了。但是,由于这个svg是一部通过JavaScript加载的,我爬下来的没有svg图片。这个方法我走不通,如果有大佬可以,求教。
- 然后我想到是不是有隐藏链接可以下载呢,我经过F12的一番搜索以及和同学的讨论,最终结果是,svg是由页面自己调用js渲染的,所以也是比较慢的原因之一。那么我就需要换一个思路,我怎么让浏览器自动地点击并下载呢,然后我就找到了selenium。让浏览器自动模拟我的手动操作进行下载图片。
可以查看相关文档 - 这里面我遇到的几个问题是 1. 如何找到该位置点击,2. 下载之后我怎么重命名 3. 页面加载太慢了,需要设置sleep。4. 将一个fasta文件 切割为一个序列一个文件作为输入。5. 遍历文件夹中的文件,并获取文件名。
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time
import requests
import sys
import json
import os
def get_UUID(file):
'''
根据接口,拼接提交的数据,获取UUID
:param file: 序列文件,只有序列没有id,
如:MIDEKAQEKVINLSQKYVNSRKILAKVRRNRLERKIIKELAEFYGIRKDNIELFNNEIEFEFKRQYEIEDLKQEINVIVNLKIRKENNKLKLLEVKLEKSITSNS
:return: 返货UUID
'''
url = 'http://bioinf.cs.ucl.ac.uk/psipred/api/submission.json'
# payload = {'input_data': ('prot.txt', open('../submissions/files/prot.txt', 'rb'))}
# test.fasta文件只有序列,没有序号,并且只有一条
#payload = {'input_data': ('test.fasta', open('./test.fasta', 'rb'))}
#file = './test.fasta'
payload = {'input_data': (file, open(file, 'rb'))}
data = {'job': 'psipred',
'submission_name': 'xu',
'email': '2535690564@qq.com'}
r = requests.post(url, data=data, files=payload)
print(r.text)
UUID = json.loads(r.text)["UUID"]
return UUID
# 通过UUID拼接url
def get_url(UUID):
# UUID = "68a83dbc-b8dc-11ea-b7bb-00163e100d53"
url_new = "http://bioinf.cs.ucl.ac.uk/psipred/&uuid="+UUID
print("url_new = "+url_new)
return url_new
# # 设置文件下载目录
# def set_chrome_pref():
# chromeOptions = webdriver.ChromeOptions()
# # prefs = {"download.default_directory":"D:\\"}
# output_file = "E:\\test0628\\"
# prefs = {"download.default_directory":output_file}
# chromeOptions.add_experimental_option("prefs", prefs)
# #driver = webdriver.Chrome(chrome_options=chromeOptions)
# time.sleep(5)
# return chromeOptions
# 点击下载按钮 获得网页中的svg,并重命名
def get_svg(url_new, ID_name):
#chromeOptions = set_chrome_pref()
#driver = webdriver.Chrome(chrome_options=chromeOptions)
driver = webdriver.Chrome()
driver.get(url_new) # 请求网页地址
#print(driver.page_source)#打印网页源代码
time.sleep(30)
svg_plot = driver.find_element_by_xpath('//*[@id="annotationGridsvgText"]')
svg_cartoon = driver.find_element_by_xpath('//*[@id="psipredChartsvgText"]')
svg_collapse = driver.find_element_by_xpath('//*[@id="psipred_cartoon"]/div[1]/div/button')
svg_plot.click()
time.sleep(5) #等待一下才能下载
origin_file = "C:\\Users\\25356\\Downloads\\annotationGrid.svg"
now_file = "C:\\Users\\25356\\Downloads\\PSIPRED_Plot_"+ID_name+".svg"
#now_file = "C:\\Users\\25356\\Downloads\\hhhh2.svg"
os.rename(origin_file, now_file) # the name is now changed文件重命名
#os.rename("C:\\Users\\25356\\Downloads\\annotationGrid.svg", "C:\\Users\\25356\\Downloads\\hhhh2.svg") # the name is now changed
time.sleep(10)
svg_collapse.click()
time.sleep(5)
svg_cartoon.click()
time.sleep(5)
origin_file = "C:\\Users\\25356\\Downloads\\psipredChart.svg"
now_file = "C:\\Users\\25356\\Downloads\\PSIPRED_Cartoon_"+ID_name+".svg"
os.rename(origin_file, now_file) # the name is now changed
time.sleep(5)
driver.close() # 关闭webdriver
file_dir = './test' # 文件夹
for files in os.walk(file_dir):
#打印查看文件夹内所有文件名
print(files[2])
for filename in files[2]:
print("*****"+filename)
input_file = "./test/"+filename
UUID = get_UUID(input_file)
url = get_url(UUID)
# 切割 不要.txt后缀
ID_name = filename.split('.')
print("序列id是: "+ID_name[0])
get_svg(url,ID_name[0])
print(filename + "done")
- 感谢作者给我的回复
Firstly of all, we do not store the results of server submissions. UUIDs for the results pages are only generated when a user submits a sequence to the server and the UUID is a function of submission time (and some other items). They are just there to uniquely identify a given user’s submission. Additionally we delete results after 10 days so they are not persistent.
With regards your specific problem there are two ways to approach this. Often the easiest for people would be to download the PSIPRED software; https://github.com/psipred/psipred/ You are then free to run as many jobs as you would like. This can be done in parallel so if your institute has HPC (or cloud) capabilities you can scale the runs as needed.
Alternatively you can use the server’s REST API to send jobs to the web server programatically. You will need to be familiar with REST and web client software and then we provide a brief tutorial of the PSIPRED REST API http://bioinf.cs.ucl.ac.uk/web_servers/web_services/ Note that the REST API, like the web pages, will only let you run 10 concurrent jobs at a time. But programmatically you can likely get 2500 jobs completed inside of a week.
- 下一步要完成的话,就是使用线程池加快速度,在selenium中使用js来执行更多的操作。