Python Crawler(1) - Scrappy Introduce

Python Crawler(1) - Scrappy Introduce

>python --version
Python 2.7.13

>pip --version
pip 9.0.1 from /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (python 2.7)

>pip install scrapy

https://docs.scrapy.org/en/latest/intro/overview.html
First example here quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
name="quotes"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}

next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)

Command to check
>scrapy runspider quotes_spider.py -o quotes.json


https://docs.scrapy.org/en/latest/intro/tutorial.html
Start a New Project
>scrapy startproject tutorial

First Spider under spiders, quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Save file %s' % filename)

Run the Project
>scrape crawl quotes

A shortcut to the start_requests
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

This shell command will open all the DOM elements on the page
>scrapy shell 'http://quotes.toscrape.com/page/1’
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x104c3db90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x104c3d110>
[s] spider <DefaultSpider 'default' at 0x10582e550>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser

>response.css('title')
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

>response.css('title::text').extract()
[u'Quotes to Scrape’]

>response.css('title::text').extract_first()
u'Quotes to Scrape’

>response.xpath('//title/text()').extract_first()
u'Quotes to Scrape’

>quote = response.css("div.quote")[0]
>title = quote.css("span.text::text").extract_first()
>title
u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d'

>for quote in response.css("div.quote"):
... text = quote.css("span.text::text").extract_first()
... author = quote.css("small.author::text").extract_first()
... tags = quote.css("div.tags a.tag::text").extract()
... print(dict(text=text, author=author, tags=tags))
...
{'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}
{'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}

Change the Python Script to Parse the data in Spider
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}

Output the JSON in somewhere
>scrapy crawl quotes -o quotes.json

>response.css('li.next a::attr(href)').extract_first()
u'/page/2/‘

Find Next Page
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

Or alternatively

if next_page is not None:
yield response.follow(next_page, callback=self.parse)

Author Spider
import scrapy

class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = [ 'http://quotes.toscrape.com/' ]

def parse(self, response):
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)

for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)

def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()

yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}

>scrapy crawl author -o authors.json

Receive Parameters
>scrapy crawl quotes -o quotes-humor.json -a tag=humor

def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)


References:
https://www.debrice.com/building-a-simple-crawler/
https://gist.github.com/debrice/a34563fb078d9d2d15e8
https://scrapy.org/
https://medium.com/python-pandemonium/develop-your-first-web-crawler-in-python-scrapy-6b2ee4baf954
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值