Task Two. 爬虫学习

最新推荐文章于 2024-03-21 17:05:36 发布

domerose

最新推荐文章于 2024-03-21 17:05:36 发布

阅读量124

点赞数

本文链接：https://blog.csdn.net/weixin_42964914/article/details/105709155

版权

2.1 beautiful库入门

1. beautfulsoup基础知识
2. 使用beautifulsoup解析HTML页面
	* beautiful soup 是一个HTML/XML解析器
	* 原理：基于DOM，载入整个文档作解析。
	* 优点： - 解析HTML非常简单，API人性化，支持（css选择器，标准库中的解析器以及lxml中的XML解析器）
	* 缺点：因为要载入整个文档，所以时间和内存开销会比较大。定位到关键资源的效率不如正则表达式和XPATH，一般不推荐使用
	
3. 操作beautiful soup
	库的主要功能是解析，遍历，维护DOM(标签树)
    soup = BeautifulSoup(html, 'parser ')
	BeautifulSoup基本元素：比如 <a>
		* Tag 标签  soup.a
		*  Name 标签名 soup.a.name
		* Attributes标签的属性 soup.a.attrs
		* NavigableString 标签内非属性字符串 soup.string
		* Comment 标签内字符串的注释部分
	method .prettify()给文本加入\n，使得html的文本内容更有层次感。也可用于标签的输出。
	BS默认将html转成utf-8编码
4. 基于bs4库的HTML内容遍历方法
	DOM是一颗结点树。那么结点之间就有层级关系。
	+ 从顶点向下遍历的方法：
		- .contents 将标签下所有的子节点都存入一张列表
		- .children 迭代器
		- .descendants 迭代器
	+ 从结点向上：
		- .partent 将标签的父节点存入一张列表
		- .parents 迭代器
	+ 平级：
		- .next_sibling
		- previous_sibling


## 爬取高校数据

from bs4 import BeautifulSoup as bs
import requests

url = ""
re = requests.get(url)
re.encoding = "utf-8"
html = re.text

soup = bs(html,"html.parser")
print(soup.prettify())

top_nodes = soup.find_all("tr","alt")
rank={}
for node in top_nodes:
	tmp = node.contents
	rank.update({"排名":tmp[0], "学校名称":tmp[1], "总分":tmp[3]})