Installation
pip install requests-html
✨?✨
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
When using this library you automatically get:
- Full JavaScript support!
- CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
- XPath Selectors, for the faint of heart.
- Mocked user-agent (like a real web browser).
- Automatic following of redirects.
- Connection–pooling and cookie persistence.
- The Requests experience you know and love, with magical parsing abilities.
- Async Support
Tutorial & Usage
Make a GET request to ‘python.org’, using Requests:
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://python.org/')
Using without Requests
You can also use this library without Requests:
>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""
>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}
Grab a list of all links on the page, as–is (anchors excluded):
>>> r.html.links
Grab a list of all links on the page, in absolute form (anchors excluded):
>>> r.html.absolute_links
Select an Element with a CSS Selector (learn more):
>>> about = r.html.find('#about', first=True)
Grab an Element’s text contents:
>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
Introspect an Element’s attributes (learn more):
>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
Render out an Element’s HTML:
>>> about.html
Select an Element list within an Element:
>>> about.find('a')
Search for links within an element:
>>> about.absolute_links
Search for text on the page:
>>> r.html.search('Python is a {} language')[0]
programming
More complex CSS Selector example (copied from Chrome dev tools):
>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'
>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
XPath is also supported (learn more):
>>> r.html.xpath('a')
[<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>]
You can also select only elements containing certain text:
>>> r = session.get('http://python-requests.org/')
>>> r.html.find('a', containing='kenneth')
[<Element 'a' href='http://kennethreitz.com/pages/open-projects.html'>, <Element 'a' href='http://kennethreitz.org/'>, <Element 'a' href='https://twitter.com/kennethreitz' class=('twitter-follow-button',) data-show-count='false'>, <Element 'a' class=('reference', 'internal') href='dev/contributing/#kenneth-reitz-s-code-style'>]
JavaScript Support
Let’s grab some text that’s rendered by JavaScript:
>>> r = session.get('http://python-requests.org/')
>>> r.html.render()
>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.
Pagination
There’s also intelligent pagination support (always improving):
>>> r = session.get('https://reddit.com')
>>> for html in r.html:
... print(html)
<HTML url='https://www.reddit.com/'>
<HTML url='https://www.reddit.com/?count=25&after=t3_81puu5'>
<HTML url='https://www.reddit.com/?count=50&after=t3_81nevg'>
<HTML url='https://www.reddit.com/?count=75&after=t3_81lqtp'>
<HTML url='https://www.reddit.com/?count=100&after=t3_81k1c8'>
<HTML url='https://www.reddit.com/?count=125&after=t3_81p438'>
<HTML url='https://www.reddit.com/?count=150&after=t3_81nrcd'>
…
You can also just request the next URL easily:
>>> r = session.get('https://reddit.com')
>>> r.html.next()
https://www.reddit.com/?count=25&after=t3_81pm82'
Using without Requests
You can also use this library without Requests:
>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""
>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}
You can also render JavaScript pages without Requests:
# ^^ proceeding from above ^^
>>> script = """
() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
}
}
"""
>>> val = html.render(script=script, reload=False)
>>> print(val)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}
>>> print(html.html)
<html><head></head><body><a href="https://httpbin.org"></a></body></html>
引用源1:https://github.com/psf/requests-html
引用源2:http://html.python-requests.org/