最近开始尝试做一些小的爬虫,了解到了大神写了一个新的库,那就是request_html,所以想了解下这个库基本的方法和类,于是尝试使用这个库写几个小的爬虫。
Requests_html库的地址:https://html.python-requests.org/
1.安装
$ pipenv install requests-html
也可以手动安装,这个操作过程并不复杂。
2.基本简单方法的学习(有部分是对作者的文档的例子的翻译)
向服务器发出get请求
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://python.org/')
Links方法:获得当前页面的所有连接,以set的方式进行存储
>>> r.html.links
absolute_links:获得当前页面的所有连接,返回一个set集合,且链接都是绝对路径,可以是某个页面的,也可以选择的某个元素的
r.html.absolute_links
find:1.通过选择当前页面的CSS标签获取内容,参数first针对该标签指向多个对象是,first = true,返回第一个的标签的指针
注意:find的第一个参数是元素选择器,它可以指定更为复杂的元素选择器,第一个参数可以是一个字符串变量也可以是字符串常量。
特定查找:设定containing的值,查找包含containing值的元素
2.选择一个标签类型,返回所有包含这个标签类型的信息
about = r.html.find('#about', first=True
-------------------------------------------------------------------------
>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
---------------------------------------------------------------------------
>>> r = session.get('http://python-requests.org/')
>>> r.html.find('a', containing='kenneth')
[<Element 'a' href='http://kennethreitz.com/pages/open-projects.html'>, <Element 'a' href='http://kennethreitz.org/'>, <Element 'a' href='https://twitter.com/kennethreitz' class=('twitter-follow-button',) data-show-count='false'>, <Element 'a' class=('reference', 'internal') href='dev/contributing/#kenneth-reitz-s-code-style'>]
text:返回选择元素的文本内容
>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
Attrs:返回选择元素的属性
>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
Html:返回选择元素的html的内容
>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'
Search:搜素某部分的文本内容,返回这部分文本内容
>>> r.html.search('Python is a {} language')[0]
programming
XPath:通过xpath查找指定元素html内容
r.html.xpath('a')
[<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>]
提供对javascript的文本的搜索
>>> r = session.get('http://python-requests.org/')
>>> r.html.render()
>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
对分页的支持
>>> r = session.get('https://reddit.com')
>>> for html in r.html:
... print(html)
<HTML url='https://www.reddit.com/'>
<HTML url='https://www.reddit.com/?count=25&after=t3_81puu5'>
<HTML url='https://www.reddit.com/?count=50&after=t3_81nevg'>
<HTML url='https://www.reddit.com/?count=75&after=t3_81lqtp'>
<HTML url='https://www.reddit.com/?count=100&after=t3_81k1c8'>
<HTML url='https://www.reddit.com/?count=125&after=t3_81p438'>
<HTML url='https://www.reddit.com/?count=150&after=t3_81nrcd'>
可以通过迭代的方式获取下一页的链接:(如下)
>>> r = session.get('https://reddit.com')
>>> r.html.next()
'https://www.reddit.com/?count=25&after=t3_81pm82'
可以对自己写的HTML进行的操作,同时不需要使用到Request库
>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""
>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}
Javascript也同样适用
# ^^ proceeding from above ^^
>>> script = """
() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
}
}
"""
>>> val = html.render(script=script, reload=False)
>>> print(val)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}
>>> print(html.html)
<html><head></head><body><a href="https://httpbin.org"></a></body></html>