Discuz 论坛模块全部帖子和评论爬取
Discuz 是一款由PHP编写的开源论坛
image.png
要爬取的页面地址:
创建工程
scrapy startproject discuz
C:\Users\PeiJingbo\Desktop\discuz>scrapy startproject discuz
New Scrapy project 'discuz', using template directory 'c:\program files\python37\lib\site-packages\scrapy\templates\project', created in:
C:\Users\PeiJingbo\Desktop\discuz\discuz
You can start your first spider with:
cd discuz
scrapy genspider example example.com
C:\Users\PeiJingbo\Desktop\discuz>
cd discuz
创建爬虫
scrapy genspider discuz_spider discuz,net
C:\Users\PeiJingbo\Desktop\discuz\discuz>scrapy genspider discuz_spider discuz,net
Created spider 'discuz_spider' using template 'basic' in module:
discuz.spiders.discuz_spider
打开工程
image.png
应该打开创建项目命令生成的那个目录 如果选择再下层目录 就不能导模块了
修改配置
settings,py
ROBOTSTXT_OBEY = False # 不遵循ROBOTS协议
DEFAULT_REQUEST_HEADERS = { # 设置默认请求头
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'user-agent': ' Mozilla/