学习笔记 -- Python爬虫 - 数据解析之bs4

最新推荐文章于 2024-06-05 09:13:17 发布

Leer_weini

最新推荐文章于 2024-06-05 09:13:17 发布

阅读量318

点赞数

分类专栏： Python爬虫 Python基础文章标签：定位 python html

本文链接：https://blog.csdn.net/Leer_weini/article/details/110139800

版权

Python爬虫同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

Python基础

7 篇文章 0 订阅

订阅专栏

#内容为视频笔记及个人理解,若有错误还望各位大佬指正

bs4.BeautifulSoup()

bs4 通过将源码实例化为一个对象来进行处理

如何实例化一个对象

* soup = BeautifulSoup() 接收文本返回 soup 对象

1. 将本地的 html 文档加载到该对象中
fp = open("baidu.html", "r", encoding="utf-8")
soup = BeautifulSoup(fp, "lxml")
将文件对象作为 BeautifulSoup 的一个参数, 第二个参数"lxml" 表示以 lxml 作为解析器
  
2.将网页源码加载到对象中
response = requests.get(url, headers)
response_txt = response.text
soup = BeautifulSoup(response_txt, "lxml")

提供的用于数据解析的方法和属性:

soup.tagname

soup.tagname并不是soup的属性这里只是表示可以接收标签的名称

例如: soup 保存百度首页的源码
print(soup.a)  a 是源码中的标签
  
返回的结果为:
<a class="toindex" href="/">百度首页</a>
  
如果标签不存在则返回:
None
  
!!!返回第一次出现该标签位置的值!!!

soup.find()

soup.find() 可以单独接收一个标签返回对应的第一次出现的内容, 也可以接收标签及其属性返回对应的第一次出现的内容

1. soup.find("a") 接收一个标签的名称 返回该标签第一次出现的位置的值, 效果等同于 soup.a

# soup 对象为 百度首页 的源码
print(soup.find("a"))  <--接收的参数是一个字符串
 
返回结果: # 返回源码中第一次出现的 a 标签
<a class="toindex" href="/">百度首页</a>


2. soup.find("a", b=" ")  <--a为标签(字符串), b为属性名称, =后为内容(字符串), 当属性名为class时, 可以写成class_
    
# soup对象为 百度首页 的源码
print(soup.find("textarea", id="s_is_result_css"))

返回结果:  # 返回 textarea标签中, id属性值为 s_is_result_css 的标签
<textarea id="s_is_result_css" style="display:none;">………………

soup.findAll() 类似于 re.findall 返回一个满足条件的列表, 注意 findAll 中的 A 为大写

# soup 对象为 百度首页 的源码 
print(soup.findAll("a"))

返回结果: # 返回所有的 a 标签列表
[<a class="toindex" href="/">百度首页</a>, <a class="pf" href="javascript:;"…………………………]

soup.select() 类选择器

soup.select() 参数为某种选择器, 返回的是一个列表

# soup 对象为 百度首页 的源码
    
print(soup.select("#s_top_wrap"))  <-- 格式: 
# select 接收一个字符串, 字符串中 # 号表示 id 属性, 属性值为 s_top_wrap

返回结果: 返回 id属性值为 s_top_wrap 的标签
[<div class="s-top-wrap s-isindex-wrap" id="s_top_wrap"><div class="s-top-nav"></div><div class="s-center-box"></div></div>]

soup.select() 层级定位

# soup 对象为 百度首页 的源码

print(soup.select("#wrapper > .s_tab > .s_tab_inner > a"))
# 以 ">" 作为层级分隔符, 输入 标签 或 属性及属性的值
    
返回结果: 
[<a class="s-tab-item s-tab-news" href="https://www.baidu.c…………]

    
# 跨层级定位
print(soup.select("#wrapper > .s_tab a"))
# 以 " "(空格) 作为跨层级分隔符
    
返回结果:
[<a class="s-tab-item s-tab-news" href="https://www.baidu.c…………]

获取标签中的文本数据及属性值

获取标签的文本数据, 对标签定位后, 可以调用 .text 或 .string 或 .get_text() 来获取文本内容

text / get_text(): 获取标签当中的所有文本内容

string: 获取标签当中的直系内容

# soup对象为 百度首页 的源码
    
print(soup.find("div", class_="s_tab_inner").text)
print(soup.find("div", class_="s_tab_inner").string)
print(soup.find("div", class_="s_tab_inner").get_text())
    
返回结果:
网页资讯视频图片知道文库贴吧地图采购更多
None
网页资讯视频图片知道文库贴吧地图采购更多

获取标签中的属性值, 对标签定位后, 可以通过 [ ] 的方式获取到属性值

soup.a[“href”]: 类似于索引的方式获取 a 标签中的 href 属性值

# soup对象为 百度首页 的源码

print(soup.find("a", sync="true")["href"])

返回结果:
https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news    

索引的内容是一个字符串

bs4(案例2)

需求: 爬取诗词名句中的三国演义

from bs4 import BeautifulSoup
import requests
import os

def get_chapter():
    url = "https://www.shicimingju.com/book/sanguoyanyi.html"
    header = {
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36"
    }
    source_response = requests.get(url=url, headers=header)
    source_response_txt = source_response.text
    source_soup = BeautifulSoup(source_response_txt, "lxml")
    source_soup_list = source_soup.select(".book-mulu > ul a")
    head = "https://www.shicimingju.com"

    try:
        os.mkdir("三国演义")
    except:
        pass

    for i in source_soup_list:
        title = i.text
        link = i["href"]
        chapter_url = head + link
        response = requests.get(url=chapter_url, headers=header)
        response_txt = response.text
        soup = BeautifulSoup(response_txt, "lxml")
        chapter_text = soup.find("div", class_="chapter_content").text
        with open("三国演义\\%s.txt" % title, "w", encoding="utf-8") as tf:
            tf.write(chapter_text)

def main():
    get_chapter()

if __name__ == "__main__":
    main()

Leer_weini

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
学习笔记 -- Python爬虫 - 数据解析之bs4

#内容为视频笔记及个人理解,若有错误还望各位大佬指正聚焦爬虫建立在通用爬虫的基础之上, 提取网页当中的部分内容, 学习中共记录了三种提取的办法正则表达式bs4Xpath正则表达式 (案例1)要求: 爬取糗事百科的图片import requestsimport reimport osdef get_pic(num_): for num in range(1, num_+ 1): url = "https://www.qiushibaike.c.
复制链接

扫一扫