爬取猫眼电影top100榜单时，发现最后的输出为空文件，如何解决？

最新推荐文章于 2021-11-25 20:58:20 发布

飘羽

最新推荐文章于 2021-11-25 20:58:20 发布

阅读量1.4k

点赞数 1

分类专栏： Python

本文链接：https://blog.csdn.net/u011808596/article/details/108808717

版权

Python 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

个人的测试代码：

import requests
from bs4 import BeautifulSoup

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'

fw = open('maoyan_top100.txt', 'w')

for i in range(10):
    # 根据offset调整链接，此处爬取的是第 0 10 20 30 40 50 60 70 80 90,10个电影名称信息
    url = 'http://maoyan.com/board/4?offset=' + str(i * 10)
    response = requests.get(url, headers=headers)
    html = BeautifulSoup(response.text, 'html5lib')
    
    titles = html.find_all('p', attrs={'class': 'name'})
    titles = [i.get_text() for i in titles]
    
    stars = html.find_all('p', attrs={'class': 'star'})
    stars = [i.get_text().strip().lstrip('主演：') for i in stars]
    
    dates = html.find_all('p', attrs={'class': 'releasetime'})
    dates = [i.get_text().lstrip('上映时间：') for i in dates]
    
    scores = html.find_all('p', attrs={'class': 'score'})
    scores = [i.get_text() for i in scores]
    
    imgs = html.find_all('img', attrs={'class': 'board-img'})
    imgs = [i.get('data-src') for i in imgs]
    
    for i in range(len(titles)):
        fw.write('\t'.join([titles[i], stars[i], dates[i], scores[i], imgs[i]]) + '\n')
        
fw.close()

运行后发现产生的 maoyan_top100.txt 文件是空文件。

问题解析：

原因是：403错误，由于频繁的访问导致的，该网站限制了你当前使用的IP的访问。这也是很多网站为了安全，设置的一种反爬虫机制。

解决办法：

1、可以通过手机分享热点形式来改变访问的IP。

2、可以通过设置IP代理形式来解决。

恭喜您已经解决自己的问题！！谢谢！