【Python爬虫2】网页数据提取

最新推荐文章于 2024-04-30 13:19:06 发布

Wu_Being

最新推荐文章于 2024-04-30 13:19:06 发布

阅读量9k

点赞数 1

分类专栏： Python网络爬虫 Python网络爬虫文章标签： python 爬虫正则表达式

本文链接：https://blog.csdn.net/u014134180/article/details/55506973

版权

文章目录

2 性能对比
3 为链接爬虫添加抓取回调

我们让这个爬虫比每个网页中抽取一些数据，然后实现某些事情，这种做法也被称为 提取（scraping）。
#1 提取数据方法

正则表达式
BeautifulSoup模块（流行）
Lxml（强大）

1.1 正则表达式

下面是用正则表达式提取国家面积数据的例子。
正则表达式文档：https://docs.python.org/3/howto/regex.html

# -*- coding: utf-8 -*-
import urllib2
import re

def scrape(html):
    area = re.findall('<tr id="places_area__row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html)[0]
    return area

if __name__ == '__main__':
    html = urllib2.urlopen('http://example.webscraping.com/view/China-47').read()
    print scrape(html)

正则表达式容易适应未来网站的变化，但难以构造、可读性差，难于适应布局微小的变化。

1.2 流行的BeautifulSoup模块

安装：pip install beautifulsoup4
有些网页不具备良好的HTML格式，如下面HTML就存在属性两侧引号缺失和标签未闭合问题。

<ul class=country>
	<li>Area
	<li>Population
</ul>

这样提取数据往往不能得到预期结果，但可以Beautiful Soup来处理。

>>> from bs4 import BeautifulSoup
>>> brocken_html='<ul class=country><Li>Area<li>Population</ul>'
>>> soup=BeautifulSoup(brocken_html,'html.parser')
>>> fixed_html=soup.prettify()
>>> print fixed_html
<ul class="country">
 <li>
  Area
  <li>
   Population
  </li>
 </li>
</ul>
>>> 
>>> ul=soup.find('ul',attrs={'class':'country'})
>>> ul.find('li')
<li>Area<li>Population</li></li>
>>> ul.find_all('li')
[<li>Area<li>Population</li></li>, <li>Population</li>]
>>>

BeautifulSoup官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/
下面是用BeautifulSoup提取国家面积数据的例子。

# -*- coding: utf-8 -*-

import urllib2
from bs4 import BeautifulSoup

def scrape(html):
    soup = BeautifulSoup(html) 
    tr = soup.find(attrs={'id':'places_area__row'}) # locate the area row
    # 'class' is a special python attribute so instead 'class_' is used
    td = tr.find(attrs={'class':'w2p_fw'})  # locate the area tag
    area = td.text  # extract the area contents from this tag
    return area

if __name__ == '__main__':
    html = urllib2.urlopen('http://example.webscraping.com/view/United-Kingdom-239').read()
    print scrape(html)

虽然BeautifulSoup正则表达式更加复杂，但容易构造和理解，而且无须担心多余空格和标签属性这样布局上的小变化。

1.3 强大的Lxml模块

Lxml是基于libxml2这个XML解析库的Python封装。该模块用C语言编写的，解析速度比Beautiful Soup更快，不过安装过程也更为复杂。最新的安装说明可以参考http://Lxml.de/installation.html 。
和Beautiful Soup一样，使用lxml模块的第一步也是将有可能不合法的HTML解析为统一格式。

>>> import lxml.html
>>> broken_html='<ul class=country><li>Area<li>Population</ul>'
>>> tree=lxml.html.fromstring(broken_html) #parse the HTML
>>> fixed_html=lxml.html.tostring(tree,pretty_print=True)
>>> print fixed_html
<ul class="country">
<li>Area</li>
<li>Population</li>
</ul>

lxml也可以正确解析属性两侧缺失的引号，并闭合标签。解析完输入内容之后，进入选择元素的步骤，此时lxml有几种不用的方法：

XPath选择器（类似Beautiful Soup的find()方法）
CSS选择器（类似jQuery选择器）

这里选用CSS选择器

最低0.47元/天解锁文章

Wu_Being

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【Python爬虫2】网页数据提取

提取数据方法1 正则表达式2 流行的BeautifulSoup模块3 强大的Lxml模块性能对比为链接爬虫添加抓取回调1 回调函数一2 回调函数二3 复用上章的链接爬虫代码我们让这个爬虫比每个网页中抽取一些数据，然后实现某些事情，这种做法也被称为提取（scraping）。
复制链接

扫一扫