scrapy at a glance 研读

最新推荐文章于 2019-03-11 11:53:55 发布

星空永恒&&卡利达

最新推荐文章于 2019-03-11 11:53:55 发布

阅读量489

点赞数

分类专栏： python-爬虫

本文链接：https://blog.csdn.net/qq_24683561/article/details/53957467

版权

python-爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

scrapy 是一个application framework (框架)

获取数据方式：	web scraping
		         APIs

代码解析：
import scrapy

class QuoteSpider(scrapy.Spider):

#爬虫的名字必须唯一  在根目录下运行scrapy crawl quote或者在spiders目录下运行scrapy runspider 爬虫python文件名
	name = 'quote'		

	#这个相当于def start_requests(self):
	start_urls = [
			'http://quotes.toscrape.com/tag/humor/',
		]

	#默认选择的解析函数
	def parse(self,response):
		for quote in response.css('div.quote'):
			yield {
				'text' : quote.css('span.text::text').extract_first(),
				'author' : quote.css('small.author::text').extract_first(),
			}
		
		#定位下一页，并提取链接
		next_page = response.css('li.next a::attr("href")').extract_first()
		if next_page is not None:
			next_page = response.urljoin(next_page)
			yield scrapy.Request(next_page,callback=self.parse)

.extract_first()	有个好处，就是如果如果值不存在，返回一个None	有提取数据的作用，一般是先定位再提取

yield {
	'text' : quote.css('span.text::text').extract_first(),
	'author' : quote.css('small.author::text').extract_first(),
}
这个的作用分析：
如果注释掉这个代码段，运行scrapy crawl quote -o quote.json时，会产生一个空的json文件（就相于是个信号的样子吧）


The crawl started by making requests to the URLs defined in the start_urls attribute (in this case, only the URL for quotes in humor category) and called the default callback method parse, passing the response object as an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as callback.


是一个异步框架：
Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

可以使用正则表达式：
Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.

提供的一些重要特性：
Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:

    Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
    An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
    Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
    Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
    Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).
    Wide range of built-in extensions and middlewares for handling:
        cookies and session handling
        HTTP features like compression, authentication, caching
        user-agent spoofing
        robots.txt
        crawl depth restriction
        and more
    A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler
    Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more!

星空永恒&&卡利达

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy at a glance 研读

scrapy 是一个application framework (框架)获取数据方式： web scraping APIs代码解析：import scrapyclass QuoteSpider(scrapy.Spider):#爬虫的名字必须唯一在根目录下运行scrapy crawl quote或者在spiders目录下运行scrapy runspider 爬虫python
复制链接

扫一扫