【学习笔记】爬虫（Ⅲ）—— BeautifulSoup和CSS选择器

贺一航【Niki】

于 2024-04-24 08:26:26 发布

阅读量880

点赞数 18

文章标签：学习笔记爬虫 beautifulsoup

本文链接：https://blog.csdn.net/Eddie_hyh/article/details/138107866

版权

BeautifulSoup

1、BeautifulSoup介绍
2、基础篇
- 1.1、页面抓取方式
- 1.2、Tag对象
- 1.3、获取元素 —— find方法族
- 1.4、获取元素 —— select方法族及 CSS选择器
- 1.5、获取元素 —— 上行遍历
- 1.6、获取元素 —— 平行遍历
- 1.7、获取元素 —— 下行遍历
3、声明

1、BeautifulSoup介绍

安装依赖：pip install bs4、pip Install requests

Beautiful Soup 是一个用于解析 HTML 和 XML 文档的 Python 库。它的主要功能是从网页中提取数据，特别是在网络爬虫和数据挖掘等任务中很有用。Beautiful Soup 可以帮助你轻松地遍历文档树、搜索特定的标签或文本，并提供了方便的方法来处理这些数据

2、基础篇

1.1、页面抓取方式

① 从网页抓取

from bs4 import BeautifulSoup
import requests

html = requests.get('https://python123.io/ws/demo.html')
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

② 从字符串抓取

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup('<p>this is a p tag</p>', 'html.parser')
print(soup.prettify())

<p>
 this is a p tag
</p>

③ 从文件抓取

from bs4 import BeautifulSoup
import requests

with open('test.html', 'r') as f:
    soup = BeautifulSoup(f, 'html.parser')
    print(soup.prettify())

<!-- test.html -->
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Title
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    this is b
   </b>
  </p>
 </body>
</html>

1.2、Tag对象

获取文本节点的方法	描述
string	如果标签只包含文本节点，没有直接子节点（直接子标签），则可以通过标签的 string 属性访问标签的文本节点；如果包含了子标签，那么访问标签的 string 属性将会得到 None
strings	获取标签自身以及子孙标签的文本，返回的是生成器generator类型
stripped_strings	同strings
get_text()	获取标签自身以及子孙标签的文本，合并成一个大字符串再返回；可以传入参数作为每个标签文本的分割符

Tip：标签中的文本实际上是一个节点，跟直接子节点（直接子标签）是同级同类型的

from bs4 import BeautifulSoup

html_content = "<p>string test</p><div>strings and stripped_strings test<span>strings</span>stripped_strings<span></span></div>"
soup = BeautifulSoup(html_content, "html.parser")
p_tag = soup.p
div_tag = soup.div

print("属性:", type(p_tag), '\n')

print("string:", p_tag.string, '\n')

print("strings:")
for string in div_tag.strings:
    print(string)

print("\nstripped_strings:")
for string in div_tag.stripped_strings:
    print(string)

属性: <class 'bs4.element.Tag'> 

string: string test 

strings:
strings and stripped_strings test
strings
stripped_strings

stripped_strings:
strings and stripped_strings test
strings
stripped_strings

from bs4 import BeautifulSoup

html_content = "<p>string test</p><div>strings and stripped_strings test<span>strings</span>stripped_strings<span></span></div>"
soup = BeautifulSoup(html_content, "html.parser")

print(soup.get_text()) 
print(soup.get_text('|', strip=True))

string teststrings and stripped_strings teststringsstripped_strings
string test|strings and stripped_strings test|strings|stripped_strings

Tag对象的属性	描述
name	标签的名字，类型：str
attrs	标签的属性，类型：dict
NavigableString	文本部分，类型：NavigableString
Comment	注释部分，类型：Comment

from bs4 import BeautifulSoup
import requests

page = requests.get("https://python123.io/ws/demo.html")
soup = BeautifulSoup(page.text, "html.parser")
soup1 = BeautifulSoup("<b><!--this is comment--></b><p>this is not comment</p>", "html.parser")

print(f"标签对象{
     type(soup.a)}：{
     soup.a}")
print(f"标签名{
     type(soup.a.name)}：{
     soup.a.name}")
print(f"标签属性{
     type(soup.a.attrs)}：{
     soup.a.attrs}")
print(f"访问标签的某个属性：{
     soup.a.attrs['class']}")
print(f"标签文本{
     type(soup.a.string)}：{
     soup.a.string}")
print(f"注释{
     type(soup1.b.string)}：{
     soup1.b.string}")