Python 爬虫学习笔记(二)

最新推荐文章于 2024-07-23 14:36:35 发布

柠檬汽水橘子汁

最新推荐文章于 2024-07-23 14:36:35 发布

阅读量103

点赞数

分类专栏： Python 爬虫文章标签： python

本文链接：https://blog.csdn.net/sinat_39665351/article/details/105231346

版权

Python 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

爬虫

12 篇文章 0 订阅

订阅专栏

python 爬虫学习笔记(二)

学习视频：
【Python网络爬虫与信息提取】.MOOC. 北京理工大学

BeautifulSoup库

HTML文档 <=> 标签树 <=> BeautifulSoup类

例：

from bs4 import BeautifulSoup //引入BeautifulSoup库
soup = BeautifulSoup(‘

data
’,‘html.parser’)

Beautiful Soup库解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

BeautifulSoup类的基本元素

基本元素	说明
Tag	标签
Name	标签的名字，格式：.name
Attributes	标签的属性，格式：.attrs
NavigableString	标签内非属性的字符串，格式：.string
Comment	标签内字符串的注释部分

标签树的平行遍历：发生在同一个父节点的孩子节点之间

.next_sibling .previous_sibling .next_siblings .previous_siblings

 >>> import requests
  >>> r = requests.get("http://python123.io/ws/demo.html")
  >>> r.text
  >>> demo = r.text
  >>> from bs4 import BeautifulSoup
  >>> soup = BeautifulSoup(demo,"html.parser")
  >>> print(soup.prettify())
  >>> tag = soup.a

>>> tag.attrs //获取标签的属性
>>> tag.attrs['class'] //获取class属性的值
>>> tag.attrs['href'] //获取标签的链接属性
>>> type(tag.attrs) //标签属性的类型
>>> type(tag) //标签类型
>>> soup.a
>>> soup.a.string
>>> type(soup.a.string)
>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser") //新做一锅“汤”
>>> newsoup.b.string
>>> type(newsoup.b.string)

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.head //查看head标签
>>> soup.head.contents //head标签的儿子节点，“.contents”返回的是列表
>>> len(soup.body.contents) //获取标签的儿子节点数
>>> soup.body.contents[1]

标签树下行遍历：.contents 、.children 、.descendants

遍历儿子节点

>>> for child in soup.body.children:
	print(child)

遍历子孙节点

>>> for child in soup.body.children:
	print(child)

标签树上行遍历：.parent节点的父亲标签， .parents节点的所有先辈标签

标签树上行遍历代码：

>>> soup = BeautifulSoup(demo,"html.parser")
>>> for parent in soup.a.parents:
    	if parent is None:
            print(parent)
        else:
            print(parent.name)

标签树平行遍历（发生在同一父节点的子节点间）：.next_sibling、 .previous_sibling .next_siblings、 .previous_siblings

 >>> soup = BeautifulSoup(demo,"html.parser")
 >>> soup.a.next_sibling
 >>> soup.a.next_sibling.next_sibling
 >>> soup.a.previous_sibling
 >>> soup.a.previous_sibling.previous_sibling

遍历后续节点

>>> for sibling in soup.a.next_siblings:
	print(sibling)

遍历前续节点

>>> for sibling in soup.a.next_siblings:
	print(sibling)

柠檬汽水橘子汁

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python 爬虫学习笔记(二)

python 爬虫笔记(二)学习视频：【Python网络爬虫与信息提取】.MOOC. 北京理工大学BeautifulSoup库HTML文档 <=> 标签树 <=> BeautifulSoup类例：from bs4 import BeautifulSoup //引入BeautifulSoup库soup = BeautifulSoup(‘data’,‘...
复制链接

扫一扫

专栏目录