python中Bs4这个包是用来解析网页源码的包,爬虫程序常用这个包解析爬取网页源码进行分析,今天主要介绍这个包的一些基本使用
首先安装bs4: Pipinstall bs4
创建beautifulsoup对象
解析网页源码,首先创建beautifulsoup对象
import requests
from bs4 importBeautifulSoup
html=requests.get('http://www.baidu.com')
html.encoding=html.apparent_encoding
soup=BeautifulSoup(html.text,'html.parser')
print type(soup)
print soup.prettify()#格式化输出网页源码
结果如下图:
解析html节点一
解析文本:
html = """
<html><head><title>The Dormouse'sstory</title></head>
<body>
<p class="title" name="dromouse"><b>TheDormouse's story</b></p>
<p class="story">Once upon a time there were three littlesisters; and their names were
<a href="http://example.com/elsie" class="sister"id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister"id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister"id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
获取标签:
soup.head(获取第一个head标签)
soup.head.title(获取head标签下面的第一个title标签)
获取标签名称:
soup.head.name
获取标签文本:
soup.title.tex#t获取title标签的文本标签
soup.title.string#获取title标签的文本标签;如果标签i里面有子节点,则无法获取内容打印None,因为不知道打印那个节点的文本
获取标签属性:
soup.p.attrs#获取p标签的所有属性,以字典形式返回
soup.p[‘class’]#获取p标签的class属性的值
soup.p.get[‘class’]#同上
源码:
import requests
from bs4 import BeautifulSoup
html=requests.get('http://www.shinzenith.com')
soup=BeautifulSoup(html.text,'html.parser')
print soup.title#获取第一个title节点
print soup.link#获取第一个link节点
print soup.link.name#获取link节点的名称
print soup.link.attrs#获取link节点的属性,以字典形式返回
print soup.link['rel']#获取link节点的rel属性
print soup.link.get('rel')#同上
soup.link['rel']='udate'#修改link节点的rel属性
print soup.link
del soup.link['rel'] #删除link节点的rel属性
print soup.link
print soup.i#获取i标签
print soup.i.string#获取i标签的文本标签;如果标签i里面有子节点,则无法获取内容打印None,因为不知道打印那个节点的文本
print type(soup),type(soup.i),type(soup.i.string)#打印类型
结果:
<title>世泽资本</title>
<link href="/resources/project/images/favicon.ico"rel="icon" type="image/x-icon"/>
link
{u'href':u'/resources/project/images/favicon.ico', u'type': u'image/x-icon', u'rel':[u'icon']}
[u'icon']
[u'icon']
<linkhref="/resources/project/images/favicon.ico" rel="udate"type="image/x-icon"/>
<linkhref="/resources/project/images/favicon.ico"type="image/x-icon"/>
<iclass="wechat">微信</i>
微信
<class'bs4.BeautifulSoup'> <class 'bs4.element.Tag'> <class'bs4.element.NavigableString'>
解析html节点(结构化解析)
soup.head.contents:返回head节点的子节点(包含文本节点),以列表形式返回
soup.head.children:同上也是返回head节点的子节点,不过这是一个迭代器
soup.head.descendants(生成器孙节点)
soup.head.parent(父节点)
soup.head.parent.parent(爷爷节点)
soup.head.parents(生成器父节点。。。)
获取子节点
from bs4 import BeautifulSoup
import requests
html = """
<html><head><meta charset="utf-8"/><title>TheDormouse's story</title>this is head</head>
<body>
<p class="title" name="dromouse"><b>TheDormouse's story</b></p>
<p class="story">Once upon a time there were three littlesisters; and their names were
<a href="http://example.com/elsie" class="sister"id="link1"><!-- Elsie --></a>
<a href="http://example.com/lacie" class="sister"id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister"id="link3">Tillie</a>
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"html.parser")#html可以是html内容
print soup.head
print '列表方式生成head的子节点',soup.head.contents
print '生成器的方式生成子节点',soup.head.children
for iin soup.head.children:
print 'head的子节点:',i
结果:
<head><meta charset="utf-8"/><title>TheDormouse's story</title>this is head</head>
生成器的方式生成子节点<listiterator object at 0x000000000250CDD8>
head的子节点:<meta charset="utf-8"/>
head的子节点:<title>The Dormouse's story</title>
head的子节点: this ishead
孙节点
soup=BeautifulSoup(html,'html.parser')#生成器形式
for i in soup.head.descendants:
print i
结果:
<meta charset="utf-8"/>
<title>The Dormouse's story</title>
The Dormouse's story
this is head
父节点
soup=BeautifulSoup(html,'html.parser')
print soup.title.parent
content=soup.head.title.string
print content.parent.parent
print content.parents
for i in content.parent.parents:
print 100*'*'
print i
兄弟节点:
print soup.title.next_sibling
print soup.title.previous_sibling
结果:
this is head
soup=BeautifulSoup(html,'html.parser')
print soup.head.next_siblings#生成器
print soup.title.previous_siblings
for i in soup.p.next_siblings:
print 100*'*'
print i
<meta charset="utf-8"/>
前后节点
print soup.head.next_element
print soup.title.previous_element
结果:
<meta charset="utf-8"/>
<meta charset="utf-8"/>
for i in soup.head.next_elements:#生成器
print 100*'*'
print i
多个节点内容
soup=BeautifulSoup(html,'html.parser')
print 'soup.body节点下的文本:',soup.body.string#包含多个节点,所以无法确定打印哪个节点内容,所以结果为None
print soup.body.strings#获取所有子节点内容,生成器
print soup.body.stripped_strings#对内容中存在空行做处理,生成器
for i in soup.body.stripped_strings:
print 'soup.body节点下所有的文本包含:',i
print 'soup.head.title节点文本是:',soup.head.title.string
结果:
soup.body节点下的文本: None
<generator object _all_strings at 0x0000000003787678>
<generator object stripped_strings at 0x0000000003787678>
soup.body节点下所有的文本包含: aaa
soup.body节点下所有的文本包含: The Dormouse's story
soup.body节点下所有的文本包含: Once upon a timethere were three little sisters; and their names were
soup.body节点下所有的文本包含: Lacie
soup.body节点下所有的文本包含: and
soup.body节点下所有的文本包含: Tillie
soup.body节点下所有的文本包含: and they lived at thebottom of a well.
soup.body节点下所有的文本包含: ...
soup.body节点下所有的文本包含: aaa
soup.head.title节点文本是: The Dormouse's story
解析html节点(find、find_all)
find和findall通过多种形式查找所需节点并返回。
find:返回第一个符合条件的节点
find_all:以列表形式将符合条件的节点全部返回
根据节点名称查询:
soup.find('a')
正则表达式匹配节点名称:
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
传入列表:
soup=BeautifulSoup(html,'html.parser') for i in soup.find_all(['head','title']): print i.name
传入方法:
soup=BeautifulSoup(html,'html.parser') def condition(tag): return tag.has_attr('class') and tag.has_attr('name') print soup.find_all(condition)
结果:
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
根据节点属性值找到节点:
import re
soup=BeautifulSoup(html,'html.parser')
print soup.find_all(id='link1')
print soup.find_all(href=re.compile(r'lacie'))
print soup.find_all(class_='sister' ,id='link3')
print soup.find_all('a',id='link3')
print soup.find_all(attrs={"id":'link3','class':'sister'})
结果:
[<a class="sister"href="http://example.com/elsie" id="link1"><!-- Elsie--></a>]
[<a class="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
[<a class="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
[<a class="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/tillie"id="link3">Tillie</a>]
根据文本节点的值找到节点:
print soup.find(text="Lacie")#搜索一个内容
print soup.find_all(text=["Tillie", "Elsie", "Lacie"])#一次搜索三个内容
print soup.find_all(text=re.compile("Dormouse"))#根据正则表达式搜索内容
结果:
Lacie
[u'Lacie', u'Tillie']
[u"The Dormouse's story", u"The Dormouse'sstory"]
以上结果都是文本节点,类型为:<class'bs4.element.NavigableString'>要想获得标签节点,只需获取他的父节点即可
print type(soup.find(text="Lacie"))
print soup.find(text="Lacie").parent
结果:
<class 'bs4.element.NavigableString'>
<a class="sister"href="http://example.com/lacie" id="link2">Lacie</a>
find_all参数
limit 参数
soup.find_all("a", limit=2)#对筛选结果筛选两个内容
recursive 参数
print soup.html.find_all("b")#默认搜索结果范围是子孙节点
print soup.html.find_all("b", recursive=False)#设置recursive此参数后搜索结果范围只为子节点
CSS选择器
通过标签名查找
print soup.select('title')#直接原值表示标签名
通过类名查找
print soup.select('.sister')#.加值代表类名
通过id查找
print soup.select('#link1')##字符代表id
组合查找
print soup.select('p#link1')#p标签且id为link1得对象
print soup.select("head> title")#head标签下得title标签
print soup.select('p a[href="http://example.com/elsie"]')#p标签下a属性得href值*对象