1.1、cd 到工作目录
1.2、创建项目:scrapy startproject 项目名
1.3、cd到项目文件夹内创建蜘蛛:scrapy genspider blog www.cnblogs.com
1.4、配置文件:
1.4.1、spider:
1、设置起始start_urls为你要爬取的页面;
class PedailySpider(scrapy.Spider):
start_urls = ['http://pe.pedaily.cn/vcpe/']
2、导入item模块并实例化;
from first_scrapy.items import MyspiderItem
class PedailySpider(scrapy.Spider):
def parse(self, response):
item = MyspiderItem()
3、设置cookies
class Git1Spider(scrapy.Spider):
start_urls = ['https://github.com/exile-morganna']
def start_requests(self):
url = self.start_urls[0]
cookies_str = '_ga=GA1.2.1190047373.1543731773; _octo=GH1.1.1199554731.1543731773; user_session=6RCB6AkOT97lY9QXs98 mHgHY6m8IScKjQPsf0i70 K6GmSeeM;'
cookies = {data.split('=')[0]:data.split('=')[-1]for data in cookies_str.split('; ')}
yield scrapy.Request(
url=url,
cookies=cookies
)
1.4.2、item:
1、建模,name = scrapy.Field()
1.4.3、setting:
1、67lines,pipelines设置;2、20lines,设置ROBOTSTXT_OBEY = True;3、19lines修改代理;4、57lines设置下载器中间件;