初学者如何用 Python 写第一个爬虫？

原创于 2025-03-04 20:07:19 发布 · 2.2k 阅读

35 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #开发语言

编写第一个 Python 爬虫并不难，以下是一个简单的步骤指南，帮助从零开始。

1. 安装必要的库

首先，你需要安装 requests 和 BeautifulSoup 这两个库。requests 用于发送 HTTP 请求，BeautifulSoup 用于解析 HTML 内容。

pip install requests beautifulsoup4

2. 导入库

在你的 Python 脚本中导入所需的库。

import requests
from bs4 import BeautifulSoup

3. 发送 HTTP 请求

使用 requests.get() 方法发送一个 HTTP GET 请求来获取网页内容。

url = 'https://example.com'
response = requests.get(url)

4. 检查请求是否成功

你可以通过检查 response.status_code 来确保请求成功（状态码 200 表示成功）。

if response.status_code == 200:
    print('请求成功')
else:
    print('请求失败', response.status_code)

5. 解析 HTML 内容

使用 BeautifulSoup 解析 HTML 内容，并提取你感兴趣的数据。

soup = BeautifulSoup(response.text, 'html.parser')

6. 提取数据

假设你想提取网页的标题，可以使用以下代码：

title = soup.title.string
print('网页标题:', title)

如果你想提取所有的链接，可以这样做：

for link in soup.find_all('a'):
    print(link.get('href'))

7. 完整示例代码

以下是一个完整的示例代码，它会抓取一个网页的标题和所有链接：

import requests
from bs4 import BeautifulSoup

# 目标URL
url = 'https://example.com'

# 发送HTTP请求
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
    print('请求成功')
    
    # 解析HTML内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取网页标题
    title = soup.title.string
    print('网页标题:', title)
    
    # 提取所有链接
    print('网页链接:')
    for link in soup.find_all('a'):
        print(link.get('href'))
else:
    print('请求失败', response.status_code)