Python轻量级爬虫教程-网页解析器

最新推荐文章于 2022-08-18 15:12:33 发布

Bugggget

最新推荐文章于 2022-08-18 15:12:33 发布

阅读量2.6k

点赞数

分类专栏： python 文章标签：爬虫 python

本文链接：https://blog.csdn.net/Bugggget/article/details/76209158

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

网页解析器: 从网页中提取我们想要的数据的工具

Python的几种网页解析器:

正则表达式（模糊匹配）

结构化解析:

html.parser

BeautifulSoup（第三方插件）

lxml(第三方插件)

网页解析器之- beautiful Soup

首先测试是否安装beautiful soup4

import bs4

print bs4

如果提示以下表示已经安装了beautiful soup4 ，不然就要自己安装了

Beautiful Soup 语法

#coding:utf8
import bs4
from bs4 import  BeautifulSoup

#根据HTML网页字符串创建BeautifulSoup对象
soup = BeautifulSoup(
                    html_doc,              #HTML文档字符串
                    'html.parser',         #HTML解析器
                    from_encoding = 'utf8' #HTML文档的编码
                    )
#查找所有标签为a的节点
soup.find_all('a')

#查找所有标签为a,链接符合/view/123.htm形式的节点
soup.find_all('a', href = '/view/123.htm/')
soup.find_all('a', href = re.compile(r'/view/\d+\.htm'))

#查找所有标签为div, class为abc，文字为Python的节点
#class是python的关键字
soup.find_all('div', class_='abc', string = 'Python')
#得到节点：<a href = '1.html'>Python</a>
#获取查找到的节点的标签名称
node.name

#获取查找到的a节点的href属性
node['href']

#获取查找到的a节点的链接文字
node.get_text()

实例代码：

#coding:utf8
import bs4
import re
from bs4 import  BeautifulSoup

html_doc = """<div id="divAll">
	<div id="divPage">
	<div id="divMiddle">
		<div id="divTop">
			<div id="BlogTitle"><img src="http://www.cnpythoner.com/themes/ecworker/style/default/logo.gif" alt="python" width="230" height="60"></div>
		<div class="banner">
<a href="http://www.cnpythoner.com/peixun/info.html"><img src="http://www.cnpythoner.com/images/9354036.gif" alt="python视频教程" usemap="#AV-eggs" border="0" height="70" width="600"></a></div>
		</div>
		<div id="divNavBar">
			<div class="headLeft"></div>
		<div class="headRight"></div>
<ul>
<p class="title"><b>Python教程</b></p>
<li><a href="http://www.cnpythoner.com/" rel="nofollow">首页</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=11">入门教程</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=4">练习题</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=1">python教程</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=2">django教程</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=15">seo应用</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=16">linux</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=17">测试应用</a></li>
<li><a href="http://www.cnpythoner.com/pythonbook.html" target="_blank">书籍推荐</a></li>
"""

#根据HTML网页字符串创建BeautifulSoup对象
soup = BeautifulSoup(
                    html_doc,              #HTML文档字符串
                    'html.parser',         #HTML解析器
                    from_encoding = 'utf-8' #HTML文档的编码
                    )
print '获取所有的链接'
#查找所有标签为a的节点
links = soup.find_all('a')
for link in links:
    print link.name, link['href'], link.get_text()


print"获取指定链接"
link_node = soup.find('a', href = 'http://www.cnpythoner.com/')
print link_node.name, link_node['href'], link_node.get_text()


print"正则匹配"
link_node = soup.find('a', href = re.compile(r"tal"))
print link_node.name, link_node['href'], link_node.get_text()

print"获取p段落文字"
p_node = soup.find('a',rel="nofollow")
print p_node.name, p_node.get_text()