爬虫：BeautifulSoup(四)

最新推荐文章于 2024-08-12 17:13:59 发布

来一块提拉米苏

最新推荐文章于 2024-08-12 17:13:59 发布

阅读量206

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/jklcl/article/details/81738623

版权

爬虫专栏收录该内容

13 篇文章 0 订阅

订阅专栏

遍历文档树

从今天开始，不止是展示样例，也象征性的爬取网站的信息，边学边练

子节点

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, features="html.parser")
#1.获取单一标签内容，直接.+标签名   例如：.title .head
print(soup.title)
print(soup.head)
#2.如果打算获取更细致一点的标签内容，不断重复1，如果存在多个相同标签只返回第一个
print(soup.head.p)
#3.获取多个重复标签的内容使用到find_all("")，以元组的形式返回
print (soup.head.find_all("p"))
#4.contents方法，将获取到的子节点以元组的形式返回
head_tag = soup.head
title_tag = head_tag.contents[0]
print(title_tag)
print(len(title_tag))
#5.通过tag的 .children 生成器,可以对tag的子节点进行循环:
for child in title_tag.children:
    print(child)
#6.通过.descendants 属性可以对所有tag的子孙节点进行递归循环
for child in head_tag.descendants:
    print(child)
# 7.当遇到单一的tag，要获取里面的字符串时，可以直接使用.string，对于多个相同tag则无效
print(head_tag.string)
# 8.当遇到多个字符串的时候，可以使用 .strings 来循环获取或者是.stripped_strings（可以取出多余空格）
for string in soup.strings:
    print(repr(string))#repr() 函数将对象转化为供解释器读取的形式。
for string in soup.stripped_strings:
    print(repr(string))

父节点

title_tag = soup.title
print(title_tag)
#1. .parent标签的父标签，字符串也有父标签，<html>的父标签是BeautifulSoup对象
print(title_tag.parent)
#2. .parents 递归调用父标签
for perant in  title_tag.parents:
    print(perant.name)

兄弟节点

兄弟节点.next_sibling，next_siblings 和 .previous_siblings
这几种方法都是查找兄弟节点，但是在实际中查找的是tag旁边的空格或者是回车，所以可以

link_tag = soup.a
print(link_tag.next_sibling.next_sibling)

回退和前进

回退和前进.next_element 和 .previous_element .next_elements 和 .previous_elements
.next_element属性指向解析过程中下一个被解析的对象(字符串或tag), .previous_element 与之相反


last_a_tag = soup.find("a", id="link3")
print(last_a_tag)
print(last_a_tag.next_sibling)

实例：

比较简单的一个例子：爬取博客第一页上的

中的博客标题和博客描述

# coding=utf-8
import requests
from bs4 import BeautifulSoup

html = requests.get("https://blog.csdn.net/jklcl")
soup = BeautifulSoup(html.text, features="html.parser")

# print(soup.prettify())

#通过分析博客标题和博客描述在<header>标签里
print(soup.header)
#博客标题
print(soup.header.a.string)
#博客描述
print(soup.header.p.string)