这两天学习了Scrapy框架,它是一种专门用于爬虫的框架。
在安装完Scrapy后,首先在命令行里创建一个项目:
创建好项目后,便可以看到scrapy的结构大概是这样的:
第一步需要在items.py这个文件里设置爬取内容分成的各个属性:
import scrapy
class DmozItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
第二步是在spiders这个文件夹下创建一个py文件用于编写爬虫模块内容:
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = 'dmoz'
allowed_domain = ['dmoz.net']
start_urls = [
'http://dmoztools.net/Computers/Programming/Languages/Python/Books/',
'http://dmoztools.net/Computers/Programming/Languages/Python/Resources/'
]
def parse(self, response):
# filename = response.url.split('/')[-2]
# with open(filename, 'wb') as f:
# f.write(response.body)
sel = scrapy.selector.Selector(response)
sites = sel.xpath('//section/div/div/div/div[@class="title-and-desc"]')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/div/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('div[@class="site-descr "]/text()').extract()
# print(title,link,desc)
items.append(item)
return items
第三步,在命令行输入 scrapy crawl dmoz 启动爬虫,运行结果为:
若要保存至本地,则用 scrapy crawl dmoz -o items.json -t json命令即可。