python 之Bs4

最新推荐文章于 2024-08-08 15:42:55 发布

大力珍

最新推荐文章于 2024-08-08 15:42:55 发布

阅读量3k

点赞数 4

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_32551117/article/details/80613237

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

python中Bs4这个包是用来解析网页源码的包，爬虫程序常用这个包解析爬取网页源码进行分析，今天主要介绍这个包的一些基本使用

首先安装bs4: Pipinstall bs4

创建beautifulsoup对象

解析网页源码，首先创建beautifulsoup对象

import requests
from bs4 importBeautifulSoup
html=requests.get('http://www.baidu.com')
html.encoding=html.apparent_encoding
soup=BeautifulSoup(html.text,'html.parser')
print type(soup)
print soup.prettify()#格式化输出网页源码

结果如下图：

解析html节点一

解析文本：

html = """
<html><head><title>The Dormouse'sstory</title></head>
<body>
TheDormouse's story
Once upon a time there were three littlesisters; and their names were
<a href="http://example.com/elsie" class="sister"id="link1"></a>,
<a href="http://example.com/lacie" class="sister"id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister"id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""

获取标签：

soup.head(获取第一个head标签)

soup.head.title(获取head标签下面的第一个title标签)

获取标签名称：

soup.head.name

获取标签文本：

soup.title.tex#t获取title标签的文本标签

soup.title.string#获取title标签的文本标签；如果标签i里面有子节点，则无法获取内容打印None，因为不知道打印那个节点的文本

获取标签属性:

soup.p.attrs#获取p标签的所有属性，以字典形式返回

soup.p[‘class’]#获取p标签的class属性的值

soup.p.get[‘class’]#同上

源码：

import requests
from bs4 import BeautifulSoup
html=requests.get('http://www.shinzenith.com')
soup=BeautifulSoup(html.text,'html.parser')
print soup.title#获取第一个title节点
print soup.link#获取第一个link节点
print soup.link.name#获取link节点的名称
print soup.link.attrs#获取link节点的属性，以字典形式返回
print soup.link['rel']#获取link节点的rel属性
print soup.link.get('rel')#同上

soup.link['rel']='udate'#修改link节点的rel属性
print soup.link
del soup.link['rel'] #删除link节点的rel属性

print soup.link

print soup.i#获取i标签
print soup.i.string#获取i标签的文本标签；如果标签i里面有子节点，则无法获取内容打印None，因为不知道打印那个节点的文本
print type(soup),type(soup.i),type(soup.i.string)#打印类型

结果：

link

{u'href':u'/resources/project/images/favicon.ico', u'type': u'image/x-icon', u'rel':[u'icon']}

[u'icon']

<linkhref="/resources/project/images/favicon.ico" rel="udate"type="image/x-icon"/>

<linkhref="/resources/project/images/favicon.ico"type="image/x-icon"/>

<iclass="wechat">微信

微信

<class'bs4.BeautifulSoup'> <class 'bs4.element.Tag'> <class'bs4.element.NavigableString'>

解析html节点（结构化解析）

soup.head.contents：返回head节点的子节点（包含文本节点），以列表形式返回

soup.head.children：同上也是返回head节点的子节点，不过这是一个迭代器

soup.head.descendants(生成器孙节点)

soup.head.parent(父节点)

soup.head.parent.parent(爷爷节点)

soup.head.parents(生成器父节点。。。)

获取子节点

from bs4 import BeautifulSoup
import requests
html = """
<html><head><meta charset="utf-8"/><title>TheDormouse's story</title>this is head</head>
<body>
TheDormouse's story
Once upon a time there were three littlesisters; and their names were
<a href="http://example.com/elsie" class="sister"id="link1"></a>
<a href="http://example.com/lacie" class="sister"id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister"id="link3">Tillie</a>
and they lived at the bottom of a well.
...
"""
soup = BeautifulSoup(html,"html.parser")#html可以是html内容
print soup.head
print '列表方式生成head的子节点',soup.head.contents
print '生成器的方式生成子节点',soup.head.children
for iin soup.head.children:
print 'head的子节点：',i

结果：

<head><meta charset="utf-8"/><title>TheDormouse's story</title>this is head</head>

生成器的方式生成子节点<listiterator object at 0x000000000250CDD8>

head的子节点：<meta charset="utf-8"/>

head的子节点：<title>The Dormouse's story</title>

head的子节点： this ishead

孙节点

soup=BeautifulSoup(html,'html.parser')#生成器形式
for i in soup.head.descendants:
print i

结果：

<title>The Dormouse's story</title>

The Dormouse's story

this is head

父节点

soup=BeautifulSoup(html,'html.parser')
print soup.title.parent
content=soup.head.title.string
print content.parent.parent
print content.parents
for i in content.parent.parents:
print 100*'*'
print i

兄弟节点：

print soup.title.next_sibling
print soup.title.previous_sibling

结果：

this is head

soup=BeautifulSoup(html,'html.parser')
print soup.head.next_siblings#生成器
print soup.title.previous_siblings
for i in soup.p.next_siblings:
print 100*'*'
print i

前后节点

print soup.head.next_element
print soup.title.previous_element

结果：

多个节点内容

soup=BeautifulSoup(html,'html.parser')
print 'soup.body节点下的文本：',soup.body.string#包含多个节点，所以无法确定打印哪个节点内容，所以结果为None
print soup.body.strings#获取所有子节点内容，生成器
print soup.body.stripped_strings#对内容中存在空行做处理，生成器
for i in soup.body.stripped_strings:
print 'soup.body节点下所有的文本包含：',i
print 'soup.head.title节点文本是：',soup.head.title.string

结果：

soup.body节点下的文本： None

soup.body节点下所有的文本包含： aaa

soup.body节点下所有的文本包含： The Dormouse's story

soup.body节点下所有的文本包含： Once upon a timethere were three little sisters; and their names were

soup.body节点下所有的文本包含： Lacie

soup.body节点下所有的文本包含： and

soup.body节点下所有的文本包含： Tillie

soup.body节点下所有的文本包含： and they lived at thebottom of a well.

soup.body节点下所有的文本包含： ...

soup.body节点下所有的文本包含： aaa

soup.head.title节点文本是： The Dormouse's story

解析html节点（find、find_all）

find和findall通过多种形式查找所需节点并返回。

find:返回第一个符合条件的节点

find_all:以列表形式将符合条件的节点全部返回

根据节点名称查询：

soup.find('a')

正则表达式匹配节点名称：

import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)

传入列表：

soup=BeautifulSoup(html,'html.parser')
for i in  soup.find_all(['head','title']):
    print i.name

传入方法：

soup=BeautifulSoup(html,'html.parser')
def condition(tag):
    return tag.has_attr('class') and  tag.has_attr('name')
print soup.find_all(condition)

结果：

 [<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

根据节点属性值找到节点：

import re
soup=BeautifulSoup(html,'html.parser')
print soup.find_all(id='link1')
print soup.find_all(href=re.compile(r'lacie'))
print soup.find_all(class_='sister' ,id='link3')
print soup.find_all('a',id='link3')
print soup.find_all(attrs={"id":'link3','class':'sister'})