第一步
在爬虫py配置基本信息
class HttpbinSpider(scrapy.Spider): name = 'httpbin' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/get'] def parse(self, response): print('===============================================') print(response.text) print('===============================================') yield scrapy.Request(self.start_urls[0],dont_filter=True)
第二步
在settings.py加入头文件
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36' }
第三步
在middlewaregs.py 配置爬虫方法 需要手动安装 pip install fake_useragent
from fake_useragent import UserAgent class HttpuaDownloaderMiddleware: def process_request(self, request, spider): request.headers['User-Agent']=UserAgent().random return None
最后一步在启动文件启动
from scrapy import cmdline cmdline.execute('scrapy crawl httpbin'.split(' '))