Scrapy Spider前奏
- 观察页面内容, 查找要抓取的数据
- 利用XPath提取数据
- 运行spider来获取网站的数据,以JSON、XML格式存储/item pipeline将item存储到数据库中
程序员每日一服药:
scrapy 0.24
scrapy startproject tutorial
cd tutorial
vim turorial/item.py import scrapy class TutorialItem(scrapy.Item): title=scrapy.Field() link=scrapy.Field() desc=scrapy.Field() scrapy genspider dmoz dmoz.org 1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 class DmozSpider(scrapy.Spider): 6 name = "dmoz" 7 allowed_domains = ["dmoz.org"] 8 start_urls = ( 9 "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 10 ) 11 12 def parse(self, response): 13 filename=response.url.split("/")[-2] 14 with open(filename,'wb') as f: 15 f.write(response.body)
start_urls与官方参考文档不同,为(),“,”不能省略。生成一个名为BOOK的文件,包含指定网址的body部分。类似印象笔记保存网页的方式,哈哈,我也能写一个裁剪网页的东西了~
# -*- coding: utf-8 -*- import scrapy from tutorial.items import TutorialItem class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = ( "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", ) def parse(self, response): for sel in response.xpath('//ul/li'): item=TutorialItem() item['title']=sel.xpath('a/text()').extract() item['link']=sel.xpath('a/@href').extract() item['desc']=sel.xpath('text()').extract() yield item
利用XPath提取数据放入item中
scrapy crawl dmoz -o items.json
将提取的数据存入到json文件中