豆瓣top250海报网页的分析保存，BeautifulSoup4的初级实践操作-CSDN博客

本文链接：https://blog.csdn.net/Jhonsimon/article/details/130012851

该文介绍如何运用Python的requests和BeautifulSoup库解析豆瓣电影top250电影页面，筛选出宽度为100的封面图片，并通过循环下载保存到本地jpg文件夹，每个图片文件名以alt属性的值命名。在HTTP请求中添加User-Agent头部以避免被识别为爬虫。

摘要由CSDN通过智能技术生成

运用爬虫技术，对豆瓣电影top250电影页面进行解析，并批量保存海报图片

效果图片：

第一步：首先，导入requests库和bs4库。

import requests
from bs4 import BeautifulSoup

第二步：然后，使用BeautifulSoup对象的find_all方法查找页面中所有的img标签，并使用筛选条件出width为100的封面图片。接着，使用for循环和requests库的get方法，遍历筛选出的图片列表，获取图片链接并将其写入本地jpg文件夹中，以图片名为alt属性的值命名

headers={"User-Agent": "Mozilla/5.0 compatible; MSIE 9.0;"
         "Windows NT 6.1; Trident /5.0;"}
#1.确定目标网站
#2.获取网址(网页内容)
for x in range(10):
    url = "https://movie.douban.com/top250?start={}&filter=".format(x * 25)
    response = requests.get(url,headers=headers).text
    soup = BeautifulSoup(response,features='lxml')
    src = soup.find_all('img')
    imagesrc=soup.find_all('img',width="100")
    for s in imagesrc:
        with open("jpg/{}.jpg".format(s.get('alt')),'wb') as file:
            image = requests.get(s.get('src')).content
            file.write(image)

需要注意的是，在进行HTTP请求时加入了User-Agent头部信息，以模拟浏览器访问网页的行为，避免被识别为爬虫并被封禁或限制访问。

完整代码：

import requests
from bs4 import BeautifulSoup

headers={"User-Agent": "Mozilla/5.0 compatible; MSIE 9.0;"
         "Windows NT 6.1; Trident /5.0;"}
#1.确定目标网站
#2.获取网址(网页内容)
for x in range(10):
    url = "https://movie.douban.com/top250?start={}&filter=".format(x * 25)
    response = requests.get(url,headers=headers).text
    soup = BeautifulSoup(response,features='lxml')
    src = soup.find_all('img')
    imagesrc=soup.find_all('img',width="100")
    for s in imagesrc:
        with open("jpg/{}.jpg".format(s.get('alt')),'wb') as file:
            image = requests.get(s.get('src')).content
            file.write(image)