【爬虫】requests 结合 BeautifulSoup抓取网页数据

最新推荐文章于 2025-01-29 21:28:31 发布

顽石九变

最新推荐文章于 2025-01-29 21:28:31 发布

阅读量4.6k

点赞数 19

分类专栏： Python 从入门到深入文章标签：爬虫 beautifulsoup python requests

本文链接：https://blog.csdn.net/wlddhj/article/details/139686012

版权

Python 从入门到深入专栏收录该内容

21 篇文章

订阅专栏

一、BeautifulSoup使用步骤

BeautifulSoup 是一个用于从 HTML 或 XML 文件中提取数据的 Python 库。以下是如何使用 BeautifulSoup 来解析 HTML 并提取信息的基本步骤：

1、安装：

如果你还没有安装 BeautifulSoup，你可以使用 pip 来安装它。BeautifulSoup 通常与 lxml 或 html.parser 这样的解析器一起使用，但 lxml 通常提供更快的解析和更全面的功能。

pip install beautifulsoup4 lxml

2、导入库：

在你的 Python 脚本中，你需要导入 BeautifulSoup 和一个解析器。

from bs4 import BeautifulSoup
import requests

注意：这里我也导入了 requests 库，它用于从网络获取 HTML 内容。如果你已经有了 HTML 内容，你可以直接用它来创建 BeautifulSoup 对象。
3、获取 HTML 内容：

使用 requests 库从网页获取 HTML 内容。

url = 'http://example.com'
response = requests.get(url)
response.raise_for_status()  # 如果请求失败，这会抛出一个异常
html_content = response.text

4、解析 HTML：

使用 BeautifulSoup 和解析器来解析 HTML 内容。

soup = BeautifulSoup(html_content, 'lxml')

5、提取数据：

使用 BeautifulSoup 的各种方法和选择器来提取你感兴趣的数据。例如，使用 .find() 或 .find_all() 方法来查找标签，并使用 .get_text() 方法来获取标签内的文本。

# 查找所有的段落标签 <p>
paragraphs = soup.find_all('p')

# 打印每个段落的文本内容
for paragraph in paragraphs:
    print(paragraph.get_text())

# 查找具有特定类名的标签
divs_with_class = soup.find_all('div', class_='some-class')

# 使用 CSS 选择器查找元素
links = soup.select('a[href]')  # 查找所有带有 href 属性的 <a> 标签

6、处理属性：

你也可以获取和处理 HTML 标签的属性。例如，要获取一个链接的 href 属性，你可以这样做：

for link in soup.find_all('a'):
    print(link.get('href'))

7、清理和关闭：

在处理完 HTML 后，你可能想要关闭任何打开的文件或连接（尽管在使用 requests 和 BeautifulSoup 时通常不需要手动关闭它们）。但是，如果你的脚本涉及其他资源，请确保正确关闭它们。

8、注意事项：

尊重网站的 robots.txt 文件和版权规定。
不要过度请求网站，以免对其造成负担。
考虑使用异步请求或线程/进程池来加速多个请求的处理。
使用错误处理和重试逻辑来处理网络请求中的常见问题。

二、示例1：抓取百度百科数据

1）抓取百度百科《青春有你第三季》数据

抓取链接是：https://baike.baidu.com/item/青春有你第三季?fromModule=lemma_search-box#4-3

import json
from bs4 import BeautifulSoup
import requests

headers = {
    "accept-language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
    "cache-control": "max-age=0"
}


def getAllUsers():
    url = "https://baike.baidu.com/item/%E9%9D%92%E6%98%A5%E6%9C%89%E4%BD%A0%E7%AC%AC%E4%B8%89%E5%AD%A3?fromModule=lemma_search-box#4-3"
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # 如果请求失败，这会抛出一个异常
    html_content = response.text
    soup = BeautifulSoup(html_content, 'lxml')

    trs = soup.find('div', attrs={'data-uuid': "go12lpqgpn"}).find_all(name='tr')
    listUser = []
    for tr in trs[1:]:
        tds = tr.find_all('td')
        name = tds[0].find('a').get_text()
        head_href = tds[0].find('a').attrs['href']
        head_id = head_href.split('/')[3].split('?')[0]
        provice = tds[1].find('span').get_text()
        height = tds[2].find('span').get_text()
        weight = tds[3].find('span').get_text()
        company = tds[4].find('span').get_text()
        user = {'name': name, 'head_id': head_id, 'provice': provice, 'height': height, 'weight': weight,
                'company': company}
        listUser.append(user)

    print(listUser)
    return listUser


if __name__ == '__main__':
    listUser = getAllUsers()
    with open('user.json', 'w', encoding='utf-8') as f:
        json.dump(listUser, f, ensure_ascii=False, indent=4)

大致结果数据如下：

[
    {
        "name": "爱尔法·金",
        "head_id": "55898952",
        "provice": "中国新疆",
        "height": "184cm",
        "weight": "65kg",
        "company": "快享星合"
    },
    {
        "name": "艾克里里",
        "head_id": "19441668",
        "provice": "中国广东",
        "height": "174cm",
        "weight": "55kg",
        "company": "领优经纪"
    },
    {
        "name": "艾力扎提",
        "head_id": "55898954",
        "provice": "中国新疆",
        "height": "178cm",
        "weight": "56kg",
        "company": "简单快乐"
    },
    // ...    
]

2）图表展示抓取到的数据

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_json('user.json')

province_counts = df['provice'].value_counts().reset_index()
province_counts.columns = ['provice', 'count']
print(province_counts)

# 设置字体为支持中文的字体，比如'SimHei'（黑体），确保你的系统中安装了该字体
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用于正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

# # 创建一个尺寸为 16x10 英寸、分辨率为 200dpi 的图形窗口
plt.figure(figsize=(16, 10), dpi=200)
plt.bar(province_counts['provice'], province_counts['count'])
plt.title('每个省份的人数')
plt.xlabel('省份')
plt.ylabel('人数')
plt.xticks(rotation=45)  # 如果省份名称过长，可以旋转x轴标签以便更好地显示
plt.show()

打印如下数据

      provice  count
0     中国广东     14
1     中国江苏      9
2     中国山东      8
3     中国浙江      7
4     中国辽宁      7
5     中国湖南      6
6     中国四川      5
7     中国北京      5
8     中国贵州      5
9     中国河南      5
10    中国湖北      4
11    中国河北      4
12    中国重庆      3
13   中国内蒙古      3
14    中国新疆      3
15    中国安徽      3
16   中国黑龙江      2
17    中国江西      2
18    中国上海      2
19     加拿大      2
20    中国台湾      2
21    中国澳门      2
22    中国吉林      2
23    中国天津      2
24      美国      2
25    中国广西      1
26    中国甘肃      1
27      中国      1
28   中国哈尔滨      1
29      日本      1
30    中国宁夏      1
31    马来西亚      1
32    中国福建      1
33    中国云南      1
34    中国山西      1

输出图表：
在这里插入图片描述