第2章-数据抓取

最新推荐文章于 2024-04-08 07:15:16 发布

LeoZhang0822

最新推荐文章于 2024-04-08 07:15:16 发布

阅读量135

点赞数

分类专栏：《用Python写网络爬虫》学习笔记 Python 文章标签： python

本文链接：https://blog.csdn.net/weixin_38575501/article/details/109967791

版权

Python 同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

《用Python写网络爬虫》学习笔记

2 篇文章 0 订阅

订阅专栏

三种网页抓取方法

正则表达式

import re

url = 'www.baidu.com'
html = download(url)
re.findall('<tr id="places_area_row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html)

Beautiful Soup

这是一个非常流行的Python模块，该模块可以解析网页，并提供定位内容的便捷接口。

from bs4 import BeautifulSoup

broken_html = '<ul class=country><li>Area<li>Population</ul>'
# parse the HTML
soup = BeautifulSoup(broken_html, 'html.parser')
fixed_html = soup.prettify()
print(fixed_html)

结果：

<html>
	<body>
		<ul class="country">
			<li>Area</li>
			<li>Population</li>
		</ul>
	</body>
</html>

使用find()和find_all()方法来定位元素：

ul = soup.find('ul', attrs={'class': 'country'})
ul.find('li') # returns just the first match
ul.find_all('li') # returns all matches

Lxml

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.7 to 3.9. (Source: https://lxml.de/)

使用lxml模块的第一步是将可能不合法的HTML解析为统一格式，例如：

import lxml.html

broken_html = '<ul class=country><li>Area<li>Population</ul>'
tree = lxml.html.fromstring(broken_html) # parse the HTML
fixed_html = lxml.html.tostring(tree, pretty_print=True)
print(fixed_html)

结果：

<ul class="country">
	<li>Area</li>
	<li>Population</li>
</ul>

解析完输入内容之后，进入选择元素的步骤。lxml有几种不同的方式，比如XPath选择器、类似Beautiful Soup的find()方法，以及CSS选择器。

import lxml.html

tree = lxml.html.fromstring(html)
td = tree.cssselect('tr#places_area_row > td.w2p_fw')[0]
area = td.text_content()
print(area)

常用的选择器示例：

选择所有标签：*
选择<a>标签：a
选择所有class="link"的元素：.link
选择class="link"的<a>标签：a.link
选择id="home"的<a>标签：a#home
选择父元素为<a>标签的所有<span>子标签：a > span
选择<a>标签内部的所有<span>标签：a span
选择title属性为"Home"的所有<a>标签：a[title=Home]