Python 爬虫基础教程：正则表达式页面提取-CSDN博客

本文链接：https://blog.csdn.net/2510_91865210/article/details/148058487

一、环境准备

bash

pip install requests

二、核心代码示例

1.发送请求获取网页内容

python

import requests

def get_html(url):
    try:
        response = requests.get(url)
        response.encoding = response.apparent_encoding
        return response.text
    except Exception as e:
        print(f"请求出错: {e}")
        return None

2.使用正则表达式提取内容

python

import re

# 提取所有 URL 链接
def extract_links(html):
    pattern = r'<a\s+href="(https?://[^"]+)"'
    return re.findall(pattern, html)

# 提取图片链接
def extract_images(html):
    pattern = r'<img\s+[^>]*src="([^"]+)"'
    return re.findall(pattern, html)

3.完整爬虫示例

python

url = "https://example.com"
html = get_html(url)

if html:
    links = extract_links(html)
    for link in links[:5]:  # 显示前5个链接
        print(link)

三、正则表达式关键语法

符号	作用	示例
`.`	匹配任意字符（除换行符）	`a.c` 匹配 "abc"
`*`	前面字符出现 0 次或多次	`ab*` 匹配 "a", "ab"
`+`	前面字符出现 1 次或多次	`ab+` 匹配 "ab", "abb"
`?`	前面字符出现 0 次或 1 次（非贪婪）	`ab?` 匹配 "a", "ab"
`[]`	匹配方括号中任意字符	`[abc]` 匹配 "a", "b", "c"
`\d`	匹配数字	`\d+` 匹配 "123"
`()`	捕获组，提取括号内的内容	`(\d{4})` 提取年份

四、爬虫注意事项

遵守网站规则
- 检查 robots.txt：https://example.com/robots.txt
- 设置请求间隔：time.sleep(1)
异常处理

python

try:
    response = requests.get(url, timeout=5)
except requests.RequestException:
    print("请求失败")

五、进阶技巧

1.提取结构化数据

python

# 提取价格（如 ¥99.99）
def extract_prices(html):
    pattern = r'¥(\d+\.\d{2})'
    return re.findall(pattern, html)

2.保存结果到文件

python

with open('links.txt', 'w') as f:
    for link in links:
        f.write(link + '\n')

通过以上代码和技巧，你可以快速搭建一个简单的网页爬虫。对于复杂网站结构，建议结合使用 BeautifulSoup 库进行更高效的内容提取。