Scrapy 小白自学笔记

Scrapy 小白自学笔记

Scrapy 小白自学笔记

  1. Scrapy环境搭建
    安装scrapy
    pip install scrapy
    安装pywin32
    D:>pip install pywin32
    Collecting pywin32
    Using cached pywin32-223-cp35-cp35m-win32.whl
    Installing collected packages: pywin32
    Successfully installed pywin32-223
  2. 创建一个scrapy工程
    2.1. 创建工程
    D:\tmp>scrapy startproject tutorial
    New Scrapy project ‘tutorial’, using template directory ‘D:\ProgramFiles\Python35\lib\site-packages\scrapy\templates\project’, created in:
    D:\tmp\tutorial

You can start your first spider with:
cd tutorial
scrapy genspider example example.com

D:\tmp\tutorial>tree /F
卷 NewDisk 的文件夹 PATH 列表
卷序列号为 CC68-7CC0
D:.
│ scrapy.cfg # deploy(部署) configuration file

└─tutorial # project’s module, you’ll import your code from here
│ items.py # project items definition file
│ middlewares.py# project middlewares file
│ pipelines.py# project pipelines file
│ settings.py# project settings file
init.py

├─spiders# a directory where you’ll later put your spiders
│ │ init.py
│ │
│ └─__pycache__
└─__pycache__

2.2. 添加spider
在spiders目录下添加quotes_spider.py文件:

import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes” #唯一名称,同一工程中不可重复

def start_requests(self): #必须返回一个可迭代的Request对象
    urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

2.3. 运行
运行:scrapy crawl quotes
结果:

  1. 背后发生了什么?
    3.1. 调试
    命令行输入:
    scrapy shell ‘http://quotes.toscrape.com/page/1/’

运行结果:
D:>scrapy shell http://quotes.toscrape.com/page/1/
2018-04-06 09:55:59 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-04-06 09:55:59 [scrapy.utils.log] INFO: Overridden settings: {‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, 'LOGSTATS_
2018-04-06 09:55:59 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.corestats.CoreStats’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-06 09:56:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-06 09:56:00 [scrapy.core.engine] INFO: Spider opened
2018-04-06 09:56:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000000002AE83C8>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x00000000054A4550>
[s] spider <DefaultSpider ‘default’ at 0x6682e10>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

In [2]: request
Out[2]: <GET http://quotes.toscrape.com/page/1/>
In [3]: response
Out[3]: <200 http://quotes.toscrape.com/page/1/>

3.2. XPATH
xpath语法
Chrome浏览器打开http://quotes.toscrape.com/page/1/,右键->检查

得到的xpath路径(绝对路径):
/html/body/div/div[2]/div[1]/div[1]/span[1]
运行命令:
In [7]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]’)
Out[7]: []

In [8]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]’).extract()
Out[8]: [’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]

In [9]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]/text()’).extract()
Out[9]: [’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]
使用相对路径
分析这一段:

其内容为:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
由此,得到xpath相对路径:
//div[@class=“quote”]/span[@itemprop=“text”]/text()

In [10]:
response.xpath(’//div[@class=“quote”]/span[@itemprop=“text”]/text()’).extract()
Out[10]:
[’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’,
‘“It is our choices, Harry, that show what we truly are, far more than our abilities.”’,
‘“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”’,
‘“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”’,

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值