1.通过命令行启动
此处需使用win+r打开cmd
scrapy crawl demo1
2.使用脚本文件
from scrapy.cmdline import execute
execute(["scrapy","crawl","demo1"])
其中demo1为定义后的爬虫名
3.使用框架来运行爬虫
from demo.spiders.demo1 import Demo1Spider
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()#生成一个对象
process.crawl(Demo1Spider)#此处表示需要爬取的对象 括号内填入爬虫类名
process.start()#启动
需注意的是这些代码要和爬虫框架在同一个地址下
4.爬虫初始化
在cmd中
首先使用pip安装scarpy框架
pip install scarpy
cmd中:
scrapy startproject demo
cd demo
scrapy genspider demo1 www.baidu.com //这里的网址决定爬虫爬取的阈值
scrapy craw1 demo1 //初始化爬虫
5.相关配置文件的更改
路径demo/demo/spiders/demo1.py
这里可以看到你爬虫的类名
我这里是Demo1Spider
import scrapy
class Demo1Spider(scrapy.Spider):
name = 'demo1'
allowed_domains = ['www.baidu.com']
#爬取的范围
start_urls = ['https://www.baidu.com/']
#起始地址
def parse(self, response):
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
print("The scarpy has done an progrem ")
pass
在改配置文件中,一般会把http改成https
下边的print是测试之用,这样在初始化完成后运行爬虫
会得到这一大串返回值
这里的类名对应上面的一个启动方法的类名,注意别引用错
前面通过其他方式启动,特别是后两部时
一般是需要在改路径下建一个py文件
再把代码载入其中
有关settings文件:
BOT_NAME = 'demo'
SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'
LOG_LEVEL = "WARNING"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'demo (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
loglevel表示提示显示的等级 这里设置为除了警告之外其他都不显示
这里一般是不推荐设置的,因为你会没法看到日志文件
然后下面的robotstxt一般是取True,
如果你的爬虫无法返回值
则采用false值
含义是使用一个协议约束爬虫的爬取
而取否值是使这个协议不生效,从而得到返回值