三种网页抓取方法
-
正则表达式
import re
url = 'www.baidu.com'
html = download(url)
re.findall('<tr id="places_area_row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html)
-
Beautiful Soup
这是一个非常流行的Python模块,该模块可以解析网页,并提供定位内容的便捷接口。
from bs4 import BeautifulSoup
broken_html = '<ul class=country><li>Area<li>Population</ul>'
# parse the HTML
soup = BeautifulSoup(broken_html, 'html.parser')
fixed_html = soup.prettify()
print(fixed_html)
结果:
<html>
<body>
<ul class="country">
<li>Area</li>
<li>Population</li>
</ul>
</body>
</html>
使用find()和find_all()方法来定位元素:
ul = soup.find('ul', attrs={'class': 'country'})
ul.find('li') # returns just the first match
ul.find_all('li') # returns all matches
-
Lxml
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.7 to 3.9. (Source: https://lxml.de/)
使用lxml模块的第一步是将可能不合法的HTML解析为统一格式,例如:
import lxml.html
broken_html = '<ul class=country><li>Area<li>Population</ul>'
tree = lxml.html.fromstring(broken_html) # parse the HTML
fixed_html = lxml.html.tostring(tree, pretty_print=True)
print(fixed_html)
结果:
<ul class="country">
<li>Area</li>
<li>Population</li>
</ul>
解析完输入内容之后,进入选择元素的步骤。lxml有几种不同的方式,比如XPath选择器、类似Beautiful Soup的find()方法,以及CSS选择器。
import lxml.html
tree = lxml.html.fromstring(html)
td = tree.cssselect('tr#places_area_row > td.w2p_fw')[0]
area = td.text_content()
print(area)
常用的选择器示例:
选择所有标签:*
选择<a>标签:a
选择所有class="link"的元素:.link
选择class="link"的<a>标签:a.link
选择id="home"的<a>标签:a#home
选择父元素为<a>标签的所有<span>子标签:a > span
选择<a>标签内部的所有<span>标签:a span
选择title属性为"Home"的所有<a>标签:a[title=Home]
总结
正则表达式在一次性数据抓取中非常有用,此外还可以避免解析整个网页带来的开销;Beatutiful Soup提供了更高层次的接口,同时还能避免过多麻烦的依赖;通常情况下,lxml是我们的最佳选择,因为它速度更快,功能更加丰富。