Requests-HTML: HTML Parsing for Humans™

最新推荐文章于 2020-12-20 11:50:00 发布

ForAllThing

最新推荐文章于 2020-12-20 11:50:00 发布

阅读量210

点赞数

分类专栏： requests-html python 文章标签： requests-html python 爬虫

原文链接：https://github.com/psf/requests-html

版权

python 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

requests-html

2 篇文章 0 订阅

订阅专栏

Installation

pip install requests-html

✨?✨
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

When using this library you automatically get:

Full JavaScript support!
CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
XPath Selectors, for the faint of heart.
Mocked user-agent (like a real web browser).
Automatic following of redirects.
Connection–pooling and cookie persistence.
The Requests experience you know and love, with magical parsing abilities.
Async Support

Tutorial & Usage

Make a GET request to ‘python.org’, using Requests:

>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://python.org/')

Using without Requests

You can also use this library without Requests:

>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""
>>> html = HTML(html=doc)
>>> html.links
   {'https://httpbin.org'}

Grab a list of all links on the page, as–is (anchors excluded):

>>> r.html.links

Grab a list of all links on the page, in absolute form (anchors excluded):

>>> r.html.absolute_links

Select an Element with a CSS Selector (learn more):

>>> about = r.html.find('#about', first=True)

Grab an Element’s text contents:

    >>> print(about.text)
    About
    Applications
    Quotes
    Getting Started
    Help
    Python Brochure

Introspect an Element’s attributes (learn more):

    >>> about.attrs
	{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}

Render out an Element’s HTML:

>>> about.html

Select an Element list within an Element:

>>> about.find('a')

Search for links within an element:

>>> about.absolute_links

Search for text on the page:

>>> r.html.search('Python is a {} language')[0]
	programming

More complex CSS Selector example (copied from Chrome dev tools):

>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'

>>> print(r.html.find(sel, first=True).text)
    GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.

XPath is also supported (learn more):

>>> r.html.xpath('a')
	[<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>]

You can also select only elements containing certain text:

>>> r = session.get('http://python-requests.org/')
>>> r.html.find('a', containing='kenneth')
    [<Element 'a' href='http://kennethreitz.com/pages/open-projects.html'>, <Element 'a' href='http://kennethreitz.org/'>, <Element 'a' href='https://twitter.com/kennethreitz' class=('twitter-follow-button',) data-show-count='false'>, <Element 'a' class=('reference', 'internal') href='dev/contributing/#kenneth-reitz-s-code-style'>]

JavaScript Support

Let’s grab some text that’s rendered by JavaScript:

>>> r = session.get('http://python-requests.org/')

>>> r.html.render()

>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'

Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.
Pagination
There’s also intelligent pagination support (always improving):

>>> r = session.get('https://reddit.com')
>>> for html in r.html:
...     print(html)
<HTML url='https://www.reddit.com/'>
<HTML url='https://www.reddit.com/?count=25&after=t3_81puu5'>
<HTML url='https://www.reddit.com/?count=50&after=t3_81nevg'>
<HTML url='https://www.reddit.com/?count=75&after=t3_81lqtp'>
<HTML url='https://www.reddit.com/?count=100&after=t3_81k1c8'>
<HTML url='https://www.reddit.com/?count=125&after=t3_81p438'>
<HTML url='https://www.reddit.com/?count=150&after=t3_81nrcd'>
…

You can also just request the next URL easily:

>>> r = session.get('https://reddit.com')
>>> r.html.next()
https://www.reddit.com/?count=25&after=t3_81pm82'

Using without Requests

You can also use this library without Requests:

>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""

>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

You can also render JavaScript pages without Requests:

# ^^ proceeding from above ^^
>>> script = """
        () => {
            return {
                width: document.documentElement.clientWidth,
                height: document.documentElement.clientHeight,
                deviceScaleFactor: window.devicePixelRatio,
            }
        }
    """
>>> val = html.render(script=script, reload=False)

>>> print(val)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}

>>> print(html.html)
<html><head></head><body><a href="https://httpbin.org"></a></body></html>

引用源1：https://github.com/psf/requests-html
引用源2：http://html.python-requests.org/

ForAllThing

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Requests-HTML: HTML Parsing for Humans™

Installationpip install requests-html✨????✨This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.When using this library you automatically get:Full ...
复制链接

扫一扫