1、创建项目
在目标文件夹里点击文件夹路径、输入CMD并回车、进入命令行
输入命令:
scrapy startproject **Spider
2、创建爬虫
cd 进入项目文件
输入命令:
scrapy genspider demo "demo.cn"
demo为爬虫名称
"demo.cn"为爬取网站的域名,后续可删除
注意爬虫名称不要和项目名称重复
3、项目根目录建立main.py文件
main.py文件输入以下内容
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'demo'])
4、通过pycharm运行main.py爬取数据
demo.py主程序编写
class CsdnSpider(scrapy.Spider):
name = 'csdn'
allowed_domains = ['planning.gov.cn']
start_urls = []
for i in range(1, 3):
url = 'http://www.planning.org.cn/news/newslist?cid=11&page{}'.format(i)
start_urls.append(url)
def parse(self, response):
res = response.xpath('//div[@class="zoom mt20 news_list_boxb pb15 f12 l22 pr15"]')
for i in res[1:]:
title1 = i.xpath('h4/a/text()').extract_first()
title2 = i.xpath('h4/a/text()').getall()
title3 = i.xpath('h4/a/text()').get()
print('title1:', title1,'title2:',title2,'title3:',title3)
settings.py编写
ROBOTSTXT_OBEY = False //不遵守协议
DOWNLOAD_DELAY = 3 //同一网站连续页面的延迟
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0'
} //爬取头部设置
随机集成useragent
middlewares.py
from fake_useragent import UserAgent
import random
class RandomUserAgentMiddleware:
def process_request(self, request, spider):
request.headers.setdefault("User-Agent", UserAgent().random)
settings.py
DOWNLOADER_MIDDLEWARES = {
'planningSpider.middlewares.RandomUserAgentMiddleware': 400,
}