BeautifulSoup4的简单例子

最新推荐文章于 2023-05-17 21:00:00 发布

sichuanwww

最新推荐文章于 2023-05-17 21:00:00 发布

阅读量305

点赞数

分类专栏： Python 文章标签： BeautifuSoup4 爬虫

本文链接：https://blog.csdn.net/sichuanpb/article/details/102737715

版权

Python 专栏收录该内容

73 篇文章 3 订阅

订阅专栏

一、获取BeautifulSoup文档的对象
1.对象
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,
所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

<a class="sister" href="http://example.com/elsie" id="link1"></a>
#a 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容时，注释符号已经去掉了。
type(soup.a.string) --> <class 'bs4.element.Comment'>


可以传入一段字符串或一个文件句柄.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html")) or soup = BeautifulSoup("<html>data</html>

二、Tag对象属性
name
attrs 返回一个字典例如：{'class':'sister'} #可以操作字段
def __getitem__(self, key): return self.attrs[key]
def has_attr(self, key): key in self.attrs
contents
注意空格换行等都包含在contents内
def __len__(self): return len(self.contents)
def __contains__(self, x): return x in self.contents
def __iter__(self): return iter(self.contents)
string
child = self.contents[0]
if isinstance(child, NavigableString): #如果tag.contents[0] 为 NavigableString 类型子节点，直接得到子节点
return child
return child.string #如果不是则获取子标签的string属性
children
== iter(self.contents)
next_element 和 previous_elemen
如同字面意思，前一个ele和后一个ele
next_elements 和 previous_elements
当前Tag前面和后面ele的迭代器
descendants
迭代器，从contents[0] 然后.next_element直到.next_sibling.previous_element
strings 和 stripped_strings
迭代器
for descendant in self.descendants:
strings --> 遍历descendants属性，获取所有NavigableString对象
stripped_strings --> 遍历descendants属性，获取所有len(descendant.strip()) != 0 的 NavigableString对象
parent 和 parents
parent：元素的父节点
parents：迭代器一直获取.parent属性
next_sibling 和 previous_sibling
查询兄弟节点
实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白
next_siblings 和 previous_siblings 属性可以对当前节点的兄弟节点迭代输出

三、通过API获取Tag对象
一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.
1.根据标签查找(type:bs4_obj)
tag_p = soup.p #返回HTML文档第一个Tag p标签

2.通过 find 和 find_all
find(name=None, attrs={}, recursive=True, text=None, **kwargs)
#内部调用find_all方法返回第一个Tag
find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
#recursive=True 默认会检索当前tag的所有子孙节点，参数 recursive=False 只检索子节点
#limit参数限制返回数量

#名字为 name 的tag，如果name不是内置的参数名,搜索时会按tag的属性来搜索注意class的写法
soup.find_all("title") or soup.find_all(class_ = 'tb') or soup.find_all(id='link2')

soup.find_all("a", class_="sister") or soup.find_all("a", attrs={"class": "sister"})
#class_ 参数同样接受不同类型的过滤器 ,字符串,正则表达式,方法或 True
soup.find_all(class_=re.compile("itl"))
def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6
soup.find_all(class_=has_six_characters) # 过滤其class属性长度为6的标签
# text 参数可以搜搜文档中的字符串内容，接受字符串 , 正则表达式 , 列表, True .
soup.find_all(text=re.compile("Dormouse"))
soup.find_all("a", text="Elsie") #包含“Elsie”的<a>标签
soup.find_all(text="Elsie") #返回一个NavigableString对象

#实际上就是对 .children 属性的迭代搜索
3.通过soup()
内部调用find_all方法
def __call__(self, *args, **kwargs): return self.find_all(*args, **kwargs)
4.通过find_parents() 和 find_parent()
find_parent(self, name=None, attrs={}, **kwargs) #搜索当前节点的第一个父节点
find_parents(self, name=None, attrs={}, limit=None, **kwargs) #搜索当前节点的所有父节点,父父节点等
#实际上就是对 .parents 属性的迭代搜索
5.通过find_previous_siblings() 和 find_previous_sibling()： #一个与多个(list)
find_previous_siblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
#实际上就是对 .previous_siblings 属性的迭代搜索
6.通过find_all_next() 和 find_next()：
find_all_next(self, name=None, attrs={}, text=None, limit=None, **kwargs)
#实际上就是对 .next_elements 属性的迭代搜索
7.通过find_all_previous() 和 find_previous()：
find_all_previous(self, name=None, attrs={}, text=None, limit=None, **kwargs)
#实际上就是对 .previous_elements 属性的迭代搜索

五、解析部分文档
将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把想要获取以外的东西都忽略掉.
SoupStrainer 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档。
三种SoupStrainer 对象：
from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a") #把<a>标签以外的东西都忽略掉

only_tags_with_id_link2 = SoupStrainer(id="link2") #把属性id="link2"标签以外的东西都忽略掉

def is_short_string(string):
return len(string) < 10
only_short_strings = SoupStrainer(text=is_short_string) #把文本内容长度大于等于10的标签都忽略掉

将 SoupStrainer 对象作为 parse_only 参数给 BeautifulSoup 的构造方法即可
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())

from bs4 import BeautifulSoup 
from bs4 import SoupStrainer
import urllib.request

url="http://www.xhu.edu.cn"
r=urllib.request.urlopen(url)
demo=r.read()

only_tag=SoupStrainer("a")
soup=BeautifulSoup(demo,"html.parser",parse_only=only_tag)
print(soup.prettify())