How to Download Academic Articles with Python

研究编程

已于 2023-04-12 14:59:37 修改

阅读量94

点赞数

文章标签： python 开发语言

于 2023-04-08 23:51:48 首次发布

本文链接：https://blog.csdn.net/sergeyyurkov1/article/details/130028515

版权

In this lesson, we will learn about the basics of regular expressions, HTML, and the requests module. We will build a simple program that accepts a DOI (Digital Object Identifier) as input, searches Sci-Hub, a library of over 88 million freely accessible research articles and books, and downloads the requested document.

The program

First, we need to import the following two modules: requests that we will use to “read” a website and get its source HTML and re (RegEx module) that we will use to find a PDF download link hidden inside the markup.

import requests
import re

Sci-Hub interface

If we go to https://sci-hub.se/, we are greeted with the search interface of Sci-Hub:

Image 1: Sci-Hub home page

Here we have an input field where we can enter our DOI. If the reference is in the database, clicking the “open” button will take us to the article page where we can “save” (download) the document:

Image 2: Download page

On this page, if we look at the address bar (#1), we can see that the URL changed from https://sci-hub.se to https://sci-hub.se/10.1386/ejpc.6.1.91_3. Because of that, we now know that in order to get to the download page, we need two basic URL components: the base URL https://sci-hub.se and the DOI 10.1386/ejpc.6.1.91_3 which are separated by a forward slash /.

Structure of the download page URL:
https://sci-hub.se + / + 10.1386/ejpc.6.1.91_3

We will save this base URL as a variable in our program for later use. Because such settings typically don’t change during program execution, we can declare them as constants which means they should probably be in upper case.

BASE_URL = "https://sci-hub.st"

Step 2. Getting the page source

Because we are not working in a browser environment, the only way Python can interact with a webpage is by parsing the source it receives in a text format. In order to extract the download link, we first need to download the raw markup of the page. We define the function get_page_source that takes one parameter url which is the URL of the article download page (Image 2):

def get_page_source(url):
    response = requests.get(url)

    print("Status code:", response.status_code)
    if response.status_code != 200:
        return None

    return response.text

We use the get method of the requests module to fetch the URL. This gives us a special Response type object that we can query to get all information about the HTTP request that we made. For example, we can know whether the request succeeded or not. This will help us better control the flow of our program. If response.status_code is not 200, this means that either a page was not found or there was a server error; in this case we return None. If the response code is 200 (OK), we return the page source which we can access through the property text of the response object.

Step 3. Extracting the PDF download URL

Now we can scan through HTML of the download page and extract the PDF link. In a browser, if we right-click on the “↓ save” button and then choose to Inspect the element (Image 3), we will see the HTML viewer pop up in the right-hand corner of the window (Image 4).
Image 3: Right-click menu

Image 4: HTML viewer

In HTML, every visual element on a page is defined with a <tag>. Here we have the definition for our “↓ save” button. onclick is a special event that fires when the button is clicked. In this case, it executes a line of JavaScript code that replaces the current URL in the address bar with the one written in single quotes ’ ’ //moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf?download=true. By doing so, it makes the browser download the file. This is the PDF download link that we want to extract.

Step 4. Extracting the download link with RegEx

Based on what we know about the save button on Sci-Hub, the download link is hidden inside a piece of JavaScript code location.href='' inside the onclick event inside the button tag. Because there is only one such button on the page, this will be our unique pattern that we can use to identify the link and extract information that we are looking for.

<button onclick="location.href='//moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf?download=true'">↓ save</button>

Moreover, we can simplify this nested structure and only look at the inner part of the pattern: location.href='//moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf?download=true' which makes it more readable. The pattern remains unique, as this is the only button on the page that executes this particular piece of JavaScript code.

There are multiple ways to parse a webpage in Python, and regular expressions (RegEx) is one of them. Although a bit complex, RegEx is a very powerful text search and replacement tool. A RegEx pattern in Python is a string expression surrounded by quotes and prepended by a special character r: r"". First, we place our whole pattern inside an empty expression: r"location.href='//moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf?download=true'". Then we remove the part of the pattern that we want to extract, in this case the download link: r"location.href=''", and place the so-called group identifier in its place - a set of brackets r"location.href='()'" - to signify that we want to extract this inner part of the pattern, or group of characters. But as it is right now, we are not extracting anything - we need to specify what set of characters we are looking for. In RegEx, those can be upper- or lower-case alphabetic A-Z or a-z, numeric 0-9 or symbolic characters like ? or %. Because they are all commonly found in URLs, we can substitute those with a special character . which means any alphabetic, numeric or symbolic character. However, this still will not match any characters; we need to specify how many symbols to look for. We can do that with another special character *. Our pattern now looks like this: r"location.href='(.*)'" which says to look for any character appearing 0 or more times inside location.href=''.

There is one problem remaining with this pattern. Because . is a special character in RegEx, we need to “escape”, or prepend, it with a backward slash \ in the first part of the expression - location.href - so that the . reads as a regular dot and not a special character ..

We then save our final pattern to a variable:

SCIHUB_REGEX = r"location\.href='(.*)'"

Now that we know how to extract the download link from HTML with RegEx, we can write a function to actually do that. We use the findall method from the re module and supply it with the pattern that we prepared earlier together with the page source. The findall method then returns a list of strings with all matches of the pattern that it could find. As mentioned earlier, this pattern is unique, so we know that this method will return a list with only one match in it.

In the case that it doesn’t find a match, the method will return an empty list. This can happen when the requested document is not found. As you can see in Image 4, if we supply a wrong DOI, we get an error message; the save button is missing from the page - the pattern no longer matches anything.

Image 4: Document not found

def extract_download_link(html):
    urls: list = re.findall(SCIHUB_REGEX, html)

    if urls:
        return urls[0]
    else:
        return None

If the list of urls is not empty, the if statement evaluates to True and we can safely return the first and only item in the list - the download link, else we return None to signify that the document is not found.

Next, we need to look closely at the returned URL. Sci-Hub has two types of download links, absolute and relative. Those can be identified by the number of slashes at the start of the URL:

// - absolute: //moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf?download=true
/ - relative: /downloads/2019-01-13//9b/bartke2018.pdf?download=true

With absolute URLs, even though they are missing protocol identifiers, such as http:, those are valid URLs and will be correctly evaluated once we open them with either a browser or the requests module. Relative URLs, however, are missing important bits of information - the web address. We need to prepend the BASE_URL (web address) to the relative link.

if link.startswith("/"):
    link = BASE_URL + link

The next step is to actually download the document.

Step 5: Downloading the document

Here, we define a generic function called download_file that accepts url and file name as parameters. Similar to how we downloaded a page source, here we also use the get method of the requests module. The difference is that while HTML is a string of text, a file is a string of binary data. When we download files, instead of referring to response.text we find the content of our file in response.content. We create an empty file on the computer with the name supplied to our function and open it in wb mode (“writable, binary”) and write the content of our response object to it.

def download_file(url, name):
    response = requests.get(url)

    with open(name, "wb") as f:
        f.write(response.content)

For this to work, we need to provide a valid file name to the download function. If we look at a sample URL, we can see that the file name that we need is actually contained within the URL itself: //moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf?download=true and is “roos2015.pdf”. In order to extract the file name in the middle of the string, we want to use a method called rsplit that splits the string by a delimiter starting from the right side. First, we split the URL by ? which gives us the following array: ["http://moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf", "download=true"] of which we only keep the first part. Then we split that part again, but this time by a forward slash /, and keep the last part of the URL which becomes our file name.

def extract_name_from_url(url):
    # Example URL: http://moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf?download=true

    # ["http://moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf", "download=true"]
    url = url.rsplit("?", 1)[0]

    # ["http://moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58", "roos2015.pdf"]
    name = url.rsplit("/", 1)[1]

    return name

Here is the full program:

import requests
import re

BASE_URL = "https://sci-hub.st"
SCIHUB_REGEX = r"location\.href='(.*)'"


def get_page_source(url):
    response = requests.get(url)

    print("Status code:", response.status_code)
    if response.status_code != 200:
        return None

    return response.text


def extract_download_link(html):
    urls: list = re.findall(SCIHUB_REGEX, html)

    if urls:
        return urls[0]
    else:
        return None


def download_file(url, name):
    response = requests.get(url)

    with open(name, "wb") as f:
        f.write(response.content)


def extract_name_from_url(url):
    # Example URL: http://moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf?download=true

    # ["http://moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58/roos2015.pdf", "download=true"]
    url = url.rsplit("?", 1)[0]

    # ["http://moscow.sci-hub.se/4757/972ad1d618f019fd076db3139ff82a58", "roos2015.pdf"]
    name = url.rsplit("/", 1)[1]

    return name


def main():
    errors = []

    # Ask user for DOI
    doi = input("Please input a DOI:")

    # Prepare source URL
    # E.g.: https://sci-hub.se + / + 10.5534/wjmh.180018
    source_url = f"{BASE_URL}/{doi}"  # equivalent to BASE_URL + "/" + doi

    # Get page source/html to extract download link from
    html = get_page_source(source_url)
    if html is None:
        errors.append("Could not fetch source URL")

        link = None
    else:
        link = extract_download_link(html)

    if link is None:
        errors.append("A document with this DOI is not found on Sci-Hub")
    else:
        if link.startswith("/"):
            link = BASE_URL + link

        name = extract_name_from_url(link)
        download_file(link, name)

    if errors:
        print(errors)
    else:
        print("Done!")


main()

At last, we can run the program and test it with some DOIs: 10.5534/wjmh.180018, 10.1386/ejpc.6.1.91_3

It is a good idea to enclose the entry point of a program with a main function. This function, as the name suggests, is where all of the action happens; it is where we call all other functions and handle the errors.