pyppeteer报错 Protocol error (Runtime.callFunctionOn): Cannot find context with specified id

蛋黄的噗噗

于 2022-09-20 22:05:59 发布

阅读量881

点赞数

文章标签： javascript servlet 前端

原文链接：https://pythontechworld.com/issue/psf/requests-html/251

版权

原问题网址：Exception: Execution context was destroyed, most likely because of a navigation. issue - PythonTechWorld

I got something working for a specific case of webpage redirect.

At time of writing my software and packages version is:

Python==3.7.3
requests-html==0.10.0
pyppeteer==0.0.25

# for ipython notebook asyncio issues
tornado==4.5.3

Here's a excerpt of the sample target page content with redirection using both javascript and meta-tag:

<script>url="http://example.com/somewhereelse";window.location.assign(url)</script>
<noscript><meta http-equiv="refresh" content="0; url=http://example.com/somewhereelse"></noscript>

The code I ran which errored was:

from requests_html import HTMLSession

session = HTMLSession()
session.get("http://mysite.com")
r.html.render()

The above code results in:

NetworkError: Execution context was destroyed, most likely because of a navigation.

if we look carefully at the documentation:

>>> help(r.html.render)

Help on method render in module requests_html:

render(retries: int = 8, script: str = None, wait: float = 0.2, scrolldown=False, sleep: int = 0, reload: bool = True, timeout: Union[float, int] = 8.0, keep_page: bool = False) method of requests_html.HTML instance
    Reloads the response in Chromium, and replaces HTML content
    with an updated version, with JavaScript executed.
    
    :param retries: The number of times to retry loading the page in Chromium.
    :param script: JavaScript to execute upon page load (optional).
    :param wait: The number of seconds to wait before loading the page, preventing timeouts (optional).
    :param scrolldown: Integer, if provided, of how many times to page down.
    :param sleep: Integer, if provided, of how many long to sleep after initial render.
    :param reload: If ``False``, content will not be loaded from the browser, but will be provided from memory.
    :param keep_page: If ``True`` will allow you to interact with the browser page through ``r.html.page``.
    
    If ``scrolldown`` is specified, the page will scrolldown the specified
    number of times, after sleeping the specified amount of time
    (e.g. ``scrolldown=10, sleep=1``).
    
    If just ``sleep`` is provided, the rendering will wait *n* seconds, before
    returning.

The key thing is the param, "sleep".

A few points to note:

the above target page sample shows the meta refresh is content="0;... which means 0 seconds wait to redirect the page.
Looking at the javacript code there's no wait/sleep/delay either.
Under current hardware speeds, and internet access speed, I don't expect the chromium browser running headless to refresh/redirect and load target page slower than 1 seconds (unless it is a big page and multiple more redirects).

Therefore, 1 seconds wait is a reasonable time to set before returning render().

In addition we have to use keep_page for extraction of crucial information, to be shown later.

changing the input of the render() method to:

r.html.render(sleep=1, keep_page=True)

Allowed the code to run without issues. If it still errors (due to slow network speed, cpu busy, etc.), try again with higher sleep.

To find out the redirected page's URL:

>>> r.html.page.url

http://example.com/somewhereelse

This issue deals with page redirects erroring, and with this line of thought:
Although the above solution works, and it's clunky to implement a try-except loop to retry with increasing sleep time to make it work.

I'm still trying to find an equivalent of "window.onload" method to get the sleep to be automatic or dynamic wait for response from headless browser to "ping back" rather than the current method of python doing increment "polling" to check whether the redirect is completed and target URL destination has been reached.

I'm all ears to better methods if anyone comes up with any.