python 爬虫学习笔记(二)
学习视频:
【Python网络爬虫与信息提取】.MOOC. 北京理工大学
-
BeautifulSoup库
HTML文档 <=> 标签树 <=> BeautifulSoup类
例:
from bs4 import BeautifulSoup //引入BeautifulSoup库
soup = BeautifulSoup(‘data
’,‘html.parser’)
- Beautiful Soup库解析器
解析器 | 使用方法 | 条件 |
---|---|---|
bs4的HTML解析器 | BeautifulSoup(mk,‘html.parser’) | 安装bs4库 |
lxml的HTML解析器 | BeautifulSoup(mk,‘lxml’) | pip install lxml |
lxml的XML解析器 | BeautifulSoup(mk,‘xml’) | pip install lxml |
html5lib的解析器 | BeautifulSoup(mk,‘html5lib’) | pip install html5lib |
-
BeautifulSoup类的基本元素
基本元素 说明 Tag 标签 Name 标签的名字,格式:.name Attributes 标签的属性,格式:.attrs NavigableString 标签内非属性的字符串,格式:.string Comment 标签内字符串的注释部分 -
标签树的平行遍历:发生在同一个父节点的孩子节点之间
.next_sibling
.previous_sibling
.next_siblings
.previous_siblings
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.text
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
>>> tag = soup.a
>>> tag.attrs //获取标签的属性
>>> tag.attrs['class'] //获取class属性的值
>>> tag.attrs['href'] //获取标签的链接属性
>>> type(tag.attrs) //标签属性的类型
>>> type(tag) //标签类型
>>> soup.a
>>> soup.a.string
>>> type(soup.a.string)
>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser") //新做一锅“汤”
>>> newsoup.b.string
>>> type(newsoup.b.string)
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.head //查看head标签
>>> soup.head.contents //head标签的儿子节点,“.contents”返回的是列表
>>> len(soup.body.contents) //获取标签的儿子节点数
>>> soup.body.contents[1]
-
标签树下行遍历:
.contents
、.children
、.descendants
-
遍历儿子节点
>>> for child in soup.body.children: print(child)
-
遍历子孙节点
>>> for child in soup.body.children: print(child)
-
-
标签树上行遍历:
.parent
节点的父亲标签,.parents
节点的所有先辈标签- 标签树上行遍历代码:
>>> soup = BeautifulSoup(demo,"html.parser") >>> for parent in soup.a.parents: if parent is None: print(parent) else: print(parent.name)
-
标签树平行遍历(发生在同一父节点的子节点间):
.next_sibling
、.previous_sibling
.next_siblings
、.previous_siblings
>>> soup = BeautifulSoup(demo,"html.parser") >>> soup.a.next_sibling >>> soup.a.next_sibling.next_sibling >>> soup.a.previous_sibling >>> soup.a.previous_sibling.previous_sibling
-
遍历后续节点
>>> for sibling in soup.a.next_siblings: print(sibling)
-
遍历前续节点
>>> for sibling in soup.a.next_siblings: print(sibling)
-