拉勾302的问题搞得我心力憔悴,目前经过各种测试,基本确认使用动态user-agent、禁用或者使用自带的cookie无法突破拉勾302的放防爬手段。将crawl_delay设置为15s也突破不掉,最多爬了15个页面就被重定向了,和之前爬取数据分析师岗位链接地址遇到的问题相同…
下面总结一下之前学习的经验,踏着自己的尸体继续前进。
防止被ban的官方参考:
https://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
Some websites implement certain measures to prevent bots from crawling them, with varying degrees of sophistication. Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.
Here are some tips to keep in mind when dealing with these kinds of
sites:rotate your user agent from a pool of well-known ones from browsers
(google around to get a list of them) disable cookies (see
COO