1、创建爬虫项目:scrapy startproject Baidu
2、cd到项目文件夹:cd Baidu
3、创建爬虫文件:scrapy genspider baidu www.baidu.com (前面一个baidu和上面的是一样的,可以改的,但是后面这个到浏览器的)
4、定义要爬取的数据结构 items.py
import scrapy
class Baidultem(scrapy.Item):
xxx = scrapy.field()
xxx = scrapy.field()
5、解析提取数据 baidu.py
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['www.baidu.com']
start_urls = [] # 网站的第一页
def parse(self,response):
item = BaiduItem()
item['name'] = xxx
# 数据交给管道文件处理的方法
yield item
# 需要继续跟进的url交给调度器入队列
yield scrapy.Request(url=url,callback=self.xxx)
6、管道文件处理爬虫文件提取的数据 pipelines.py
class BaiduPipeline(object):
def process_item(self,item,spider):
return item
7、 settings.py
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 8
DOWNLODE_DELAY = 1
COOKLES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {'Cookie':'User-Agent':''}
ITEM_PIPELINES = {'Baidu.pipelines.BaiduPipeline':300}
8 、运行 run.py
from scrapy import cmdline
cmdline.execute('scrapy crawl baidu'.split())