Python 爬虫实战：爬取 CNN 英文新闻（含多语言文本预处理）

最新推荐文章于 2025-05-23 09:16:47 发布

yansideyucsdn

最新推荐文章于 2025-05-23 09:16:47 发布

阅读量524

点赞数 7

分类专栏： python爬虫实战文章标签： python 爬虫 cnn

本文链接：https://blog.csdn.net/yansideyucsdn/article/details/148146656

版权

python爬虫实战专栏收录该内容

50 篇文章

订阅专栏

爬虫目标与准备工作

本次实战的目标是爬取 CNN 英文新闻网站的文章内容，并对爬取到的文本进行多语言预处理。爬虫的核心任务是获取新闻标题、正文、发布时间等信息，并将其存储为结构化的数据格式。在开始之前，需要安装以下 Python 库：

requests：用于发送 HTTP 请求。
BeautifulSoup：用于解析 HTML 文档。
pandas：用于数据存储和处理。
langdetect：用于检测文本语言。
nltk：用于文本预处理。

安装命令如下：

pip install requests beautifulsoup4 pandas langdetect nltk

分析 CNN 新闻网站结构

CNN 新闻网站的 URL 结构通常为 https://edition.cnn.com/YYYY/MM/DD/category/article-title/index.html，其中 YYYY/MM/DD 表示日期，category 表示新闻类别，article-title 是文章的标题。通过分析页面结构，可以发现新闻标题位于 <h1> 标签中，正文位于 <div class="article__content"> 标签中，发布时间位于 <div class="timestamp"> 标签中。

发送请求与解析页面

使用 requests 库发送 HTTP 请求获取页面内容，然后使用 BeautifulSoup 解析 HTML 文档。以下是代码实现：

import requests
from bs4 import BeautifulSoup

# 目标 URL
url = "https://edition.cnn.com/2023/10/01/world/sample-article/index.html"

# 发送请求
response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    # 提取标题
    title = soup.find('h1').get_text(strip=True)
    # 提取正文
    content = soup.find('div', class_='article__content').get_text(strip=True)
    # 提取发布时间
    timestamp = soup.find('div', class_='timestamp').get_text(strip=True)
    print(f"Title: {title}\nContent: {content}\nTimestamp: {timestamp}")
else:
    print("Failed to retrieve the page")

数据存储

将爬取到的数据存储为 CSV 文件，方便后续分析。使用 pandas 库将数据转换为 DataFrame 并保存：

import pandas as pd

# 创建 DataFrame
data = {
    'Title': [title],
    'Content': [content],
    'Timestamp': [timestamp]
}
df = pd.DataFrame(data)

# 保存为 CSV 文件
df.to_csv('cnn_news.csv', index=False)

多语言文本预处理

爬取到的新闻可能包含多种语言，需要对文本进行语言检测和预处理。使用 langdetect 库检测文本语言，并使用 nltk 库进行分词和停用词去除：

from langdetect import detect
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# 下载 nltk 数据
nltk.download('punkt')
nltk.download('stopwords')

# 检测语言
def detect_language(text):
    try:
        return detect(text)
    except:
        return 'unknown'

# 文本预处理
def preprocess_text(text, language='en'):
    # 分词
    tokens = word_tokenize(text)
    # 去除停用词
    if language in stopwords.fileids():
        stop_words = set(stopwords.words(language))
        tokens = [word for word in tokens if word.lower() not in stop_words]
    return tokens

# 示例
text = "This is a sample text for preprocessing."
language = detect_language(text)
tokens = preprocess_text(text, language)
print(tokens)

批量爬取与自动化

为了爬取多篇新闻，可以编写一个函数，根据日期和类别生成 URL 列表，并循环爬取。以下是示例代码：

import datetime

# 生成 URL 列表
def generate_urls(start_date, end_date, category):
    urls = []
    current_date = start_date
    while current_date <= end_date:
        url = f"https://edition.cnn.com/{current_date.strftime('%Y/%m/%d')}/{category}/index.html"
        urls.append(url)
        current_date += datetime.timedelta(days=1)
    return urls

# 示例
start_date = datetime.date(2023, 10, 1)
end_date = datetime.date(2023, 10, 5)
category = "world"
urls = generate_urls(start_date, end_date, category)
for url in urls:
    print(url)

异常处理与反爬虫策略

在实际爬虫过程中，可能会遇到反爬虫机制或网络异常。可以通过设置请求头、使用代理、添加延时等方式降低被封禁的风险：

import time
import random

# 设置请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# 添加随机延时
def fetch_with_delay(url):
    time.sleep(random.uniform(1, 3))
    response = requests.get(url, headers=headers)
    return response

# 示例
url = "https://edition.cnn.com/2023/10/01/world/sample-article/index.html"
response = fetch_with_delay(url)
if response.status_code == 200:
    print("Page fetched successfully")