网页解析器: 从网页中提取我们想要的数据的工具
Python的几种网页解析器:
正则表达式(模糊匹配)
结构化解析:
html.parser
BeautifulSoup(第三方插件)
lxml(第三方插件)
网页解析器之- beautiful Soup
首先测试是否安装beautiful soup4
import bs4
print bs4
如果提示以下表示已经安装了beautiful soup4 ,不然就要自己安装了
<module 'bs4' from 'D:\Python2.7\lib\site-packages\bs4\__init__.pyc'>
Beautiful Soup 语法
#coding:utf8 import bs4 from bs4 import BeautifulSoup #根据HTML网页字符串创建BeautifulSoup对象 soup = BeautifulSoup( html_doc, #HTML文档字符串 'html.parser', #HTML解析器 from_encoding = 'utf8' #HTML文档的编码 ) #查找所有标签为a的节点 soup.find_all('a') #查找所有标签为a,链接符合/view/123.htm形式的节点 soup.find_all('a', href = '/view/123.htm/') soup.find_all('a', href = re.compile(r'/view/\d+\.htm')) #查找所有标签为div, class为abc,文字为Python的节点 #class是python的关键字 soup.find_all('div', class_='abc', string = 'Python') #得到节点:<a href = '1.html'>Python</a> #获取查找到的节点的标签名称 node.name #获取查找到的a节点的href属性 node['href'] #获取查找到的a节点的链接文字 node.get_text()实例代码:
#coding:utf8
import bs4
import re
from bs4 import BeautifulSoup
html_doc = """<div id="divAll">
<div id="divPage">
<div id="divMiddle">
<div id="divTop">
<div id="BlogTitle"><img src="http://www.cnpythoner.com/themes/ecworker/style/default/logo.gif" alt="python" width="230" height="60"></div>
<div class="banner">
<a href="http://www.cnpythoner.com/peixun/info.html"><img src="http://www.cnpythoner.com/images/9354036.gif" alt="python视频教程" usemap="#AV-eggs" border="0" height="70" width="600"></a></div>
</div>
<div id="divNavBar">
<div class="headLeft"></div>
<div class="headRight"></div>
<ul>
<p class="title"><b>Python教程</b></p>
<li><a href="http://www.cnpythoner.com/" rel="nofollow">首页</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=11">入门教程</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=4">练习题</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=1">python教程</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=2">django教程</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=15">seo应用</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=16">linux</a></li>
<li><a href="http://www.cnpythoner.com/catalog.asp?cate=17">测试应用</a></li>
<li><a href="http://www.cnpythoner.com/pythonbook.html" target="_blank">书籍推荐</a></li>
"""
#根据HTML网页字符串创建BeautifulSoup对象
soup = BeautifulSoup(
html_doc, #HTML文档字符串
'html.parser', #HTML解析器
from_encoding = 'utf-8' #HTML文档的编码
)
print '获取所有的链接'
#查找所有标签为a的节点
links = soup.find_all('a')
for link in links:
print link.name, link['href'], link.get_text()
print"获取指定链接"
link_node = soup.find('a', href = 'http://www.cnpythoner.com/')
print link_node.name, link_node['href'], link_node.get_text()
print"正则匹配"
link_node = soup.find('a', href = re.compile(r"tal"))
print link_node.name, link_node['href'], link_node.get_text()
print"获取p段落文字"
p_node = soup.find('a',rel="nofollow")
print p_node.name, p_node.get_text()