常用命令
scrapy –help
scrapy version -v
startproject
genspider a a.com(必须在项目中使用,可产生多个spider)
list
view+爬取地址
parse(采用parse中的过程对页面进行解析并打印出来)
shell+爬取地址(进入scrapy交互式环境,进行一系列测试,不需要具体的工程,常用response对象进行一系列测试)
response.css()
response.xpath()
response.re()
runspider
Xpath
具体的Xpath用法可以参考官方文档,此文档以实例形式给出,非常直观,当然,更加傻瓜式的方法是使用firebug直接获取!
初始测试代码
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
运行这个代码只需执行
scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json
抓取整个网页
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
运行scrapy crawl domz
scrapy基本使用步骤
- 创建工程
- 定义item
- 编写spider
- 配置pipeline
- 运行爬虫
spider执行流程
- request
- 返回response
- 使用selector
- 存储item
scrapy.Spider类
属性
name
allowed_domains
start_urls
custom_settings(全局配置)
crawler
setings(实例配置)
logger
方法
from_crawler():用于创建spiders
start_request():生成初始request
make_requests_from_url(url):根据url生成一个request
parse(response):解析器
log()旧版
self.logger.info(“success”)
closed(reason)
子类
CrawlSpider(增加成员rules,比如抓取与不抓取的网页,使用哪个解析器解析;成员parse_start_url(response))
XMLFeedSpider
CSVFeedSpider
SitemapSpider
request & response
class scrapy.http.Request()
class scrapy.http.Response()
备注:批量去文件名后缀
for %l in (*.txt) do ren “%l” “%~nl”