selenium爬取文献

最新推荐文章于 2024-03-15 17:00:00 发布

三岁阿瑶

最新推荐文章于 2024-03-15 17:00:00 发布

阅读量257

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/ayaoyaobaofu/article/details/121794212

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

故事的起因：师姐想让我帮她下载25篇文献，给了一串URL，需要手动的去下载，需要一直点点点。咱就说有点浪费时间。

突然发现有一个模块：selenium可以代替人工点点点。
嗯…真不错

#需要下载selenium
#如果下载过就自动忽略这一步
#pip install selenium

import requests
import re
import urllib.request
from selenium import webdriver

#url所在的位置
url_file="E:/url.txt"

#保持与本机相连，保持与服务器的会话连接
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36',
}

with open(url_file,"r") as f:
    for url in f:
        driver = webdriver.Chrome("C:\Program Files\Google\Chrome\Application\chromedriver")
    #     ret = Request(url, headers=headers)
    #     html = urlopen(ret)
    #     bs = BeautifulSoup(html, "html.parser")
    #     div = bs.find("h1", {"class": "item-meta-data__item-title"})
        #title=div.get_text()
        driver.get(url) #请求网页
        element=driver.find_element_by_css_selector(".download-pdf")
        element.click()
        window = driver.window_handles
        driver.switch_to.window(window[-1])
        pdf_url=driver.current_url
        r = requests.get(pdf_url,headers=headers) 
        print(url,r)
        if "\n" in url:
            url=url.replace("\n","")
        file="E:/{}.pdf".format(url.split("/")[-2]+"-"+url.split("/")[-1])
        with open(file, "wb") as code:
            code.write(r.content)