1.1 分析网页结构
首先,打开豆瓣电影 Top250 页面,并通过浏览器的开发者工具(F12)观察网页的结构。每部电影的标题和评分位于特定的 HTML 标签内,可以通过这些标签来提取数据。
1.2 编写爬虫
import requests
from bs4 import BeautifulSoup
def get_movies(url):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
movies = []
for item in soup.find_all('div', class_='item'):
title = item.find('span', class_='title').text
rating = item.find('span', class_='rating_num').text
movies.append({'title': title, 'rating': rating})
return movies
url = 'https://movie.douban.com/top250'
movies = get_movies(url)
# 输出爬取的电影信息
for movie in movies:
print(movie)
1.3 结果展示