scrapy爬虫实践之抓取拉钩网招聘信息（3）

最新推荐文章于 2025-05-16 11:16:25 发布

onesmile5137

最新推荐文章于 2025-05-16 11:16:25 发布

阅读量630

点赞数 1

CC 4.0 BY-SA版权

本文链接：https://blog.csdn.net/onesmile5137/article/details/90768026

本文讲述了作者在使用Scrapy爬虫抓取拉勾网招聘信息时遇到的302重定向问题，尝试了动态user-agent、禁用或使用内置cookie等方法均未成功。尽管设置crawl_delay为15s，但最多只能爬取15个页面。作者分享了防止被ban的一些官方建议，包括轮换user-agent、禁用cookies、使用下载延迟等，并提供了获取网页cookie和设置动态user-agent的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

拉勾302的问题搞得我心力憔悴，目前经过各种测试，基本确认使用动态user-agent、禁用或者使用自带的cookie无法突破拉勾302的放防爬手段。将crawl_delay设置为15s也突破不掉，最多爬了15个页面就被重定向了，和之前爬取数据分析师岗位链接地址遇到的问题相同…

下面总结一下之前学习的经验，踏着自己的尸体继续前进。

防止被ban的官方参考：
https://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

Some websites implement certain measures to prevent bots from crawling them, with varying degrees of sophistication. Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.

Here are some tips to keep in mind when dealing with these kinds of
sites:

rotate your user agent from a pool of well-known ones from browsers
(google around to get a list of them) disable cookies (see
COOKIES_ENABLED) as some sites may use cookies to spo