【笔记】2022.5.12 BeautifulSoup4

最新推荐文章于 2024-07-20 18:16:53 发布

Sprite.Nym

最新推荐文章于 2024-07-20 18:16:53 发布

阅读量174

点赞数 4

分类专栏：第二阶段网页数据收集笔记文章标签：前端 python 开发语言 BeautifulSoup

本文链接：https://blog.csdn.net/SpriteNym/article/details/124737656

版权

第二阶段网页数据收集笔记专栏收录该内容

10 篇文章 0 订阅

订阅专栏

1. BeautifulSoup4类

BeautifulSoup4：简称bs4

作用：能够在HTML或者XML文档中查找、选择我们的所需内容，bs4是python实现的模块

创建对象，对象类型是bs4：

BeautifulSoup(参数1, 参数2)

参数1：前端页面的字符串类型源码；参数2：解析器

RIGHT Example

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
"""

soup = BeautifulSoup(html, "lxml")
print(soup, type(soup))

2. select方法

（1）select：根据CSS选择器查找内容，select获取呢、页面中所有符合CSS选择器的结果，存入到列表中

RIGHT Example

p_list = soup.select('body > p')
print(p_list)
print(type(p_list[-1]))

（2）select_one：根据CSS选择器查找内容，select_one得到的结果是select结果的第一个元素

RIGHT Example

p = soup.select_one('p')
print(p, type(p))

补充：prettify：格式化bs4对象

注意：select得到的列表中的每个元素和select_one得到的结果一定是bs4类型

3. text和attrs方法

（1）text：获取html标签内的文本。例如：

abcde

--> ‘abcde’

RIGHT Example

# b. 获取第一个p标签中b标签的内容
b = soup.select_one('p.title > b').text
print(b, type(b))	# The Dormouse's story <class 'str'>

（2）attrs：获取html标签内的属性值。例如：< a href=“http://www.baidu.com” >< /a > --> ‘http://www.baidu.com’

注意：如果attrs操作对象是class属性，得到的结果是列表

RIGHT Example

# c. 获取第二个p标签中第三个a标签的href属性
a = soup.select_one('body > p:nth-child(2) > a:nth-child(3)').attrs['href']
print(a)	# http://example.com/tillie <class 'str'>

4. 案例实操

APPLICATION 使用bs4快速获取数据

import requests
import csv
from bs4 import BeautifulSoup
from tqdm import tqdm


def requests_get(href):
    Headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39'
    }
    resp = requests.get(url=href, headers=Headers)
    if resp.status_code == 200:
        return resp
    else:
        print(resp.status_code)


if __name__ == '__main__':
    f = open('today_news.csv', 'a', encoding='utf-8', newline='')
    f_writer = csv.writer(f)

    for page in tqdm(range(1, 11)):
        URL = f'https://www.chinanews.com.cn/scroll-news/news{page}.html'
        response = requests_get(URL)
        response.encoding = 'utf-8'

        # 1. 创建对象
        soup = BeautifulSoup(response.text, "lxml")

        # 2. 先找ul下的所有li
        origin_news_list = soup.select('body > div.w1280.mt20 > div.content-left > div.content_list > ul > li')

        # 3. 写入表头
        f_writer.writerow(['新闻类型', '新闻名', '新闻链接', '新闻时间'])

        # 4. 循环写入新闻
        for i in origin_news_list:
            if i.text:
                # a. 获取新闻类型
                news_type = i.select_one('li a').text

                # b. 获取新闻名
                news_name = i.select_one('li > div.dd_bt > a').text

                # c. 获取新闻链接
                news_link = 'https://www.chinanews.com.cn' + i.select_one('li div.dd_bt a').attrs['href']

                # d. 获取新闻时间
                news_time = i.select_one('li div.dd_time').text

                this_news = [news_type, news_name, news_link, news_time]
                f_writer.writerow(this_news)

        f_writer.writerow([f'第{page}页完', '', '', ''])

    f.close()

    # copy的选择器如果拿不到内容，再手写