Scrapy 小白自学笔记

最新推荐文章于 2024-04-16 10:05:22 发布

leenuxcore

最新推荐文章于 2024-04-16 10:05:22 发布

阅读量1.8k

点赞数

分类专栏： Web开发

本文链接：https://blog.csdn.net/leenuxcore/article/details/106090900

版权

Scrapy 小白自学笔记

Scrapy环境搭建
安装scrapy
pip install scrapy
安装pywin32
D:>pip install pywin32
Collecting pywin32
Using cached pywin32-223-cp35-cp35m-win32.whl
Installing collected packages: pywin32
Successfully installed pywin32-223
创建一个scrapy工程
2.1. 创建工程
D:\tmp>scrapy startproject tutorial
New Scrapy project ‘tutorial’, using template directory ‘D:\ProgramFiles\Python35\lib\site-packages\scrapy\templates\project’, created in:
D:\tmp\tutorial

You can start your first spider with:
cd tutorial
scrapy genspider example example.com

D:\tmp\tutorial>tree /F
卷 NewDisk 的文件夹 PATH 列表
卷序列号为 CC68-7CC0
D:.
│ scrapy.cfg # deploy(部署) configuration file
│
└─tutorial # project’s module, you’ll import your code from here
│ items.py # project items definition file
│ middlewares.py# project middlewares file
│ pipelines.py# project pipelines file
│ settings.py# project settings file
│ init.py
│
├─spiders# a directory where you’ll later put your spiders
│ │ init.py
│ │
│ └─__pycache__
└─__pycache__

2.2. 添加spider
在spiders目录下添加quotes_spider.py文件：

import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes” #唯一名称，同一工程中不可重复

def start_requests(self): #必须返回一个可迭代的Request对象
    urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

2.3. 运行
运行：scrapy crawl quotes
结果：

背后发生了什么？
3.1. 调试
命令行输入：
scrapy shell ‘http://quotes.toscrape.com/page/1/’

运行结果：
D:>scrapy shell http://quotes.toscrape.com/page/1/
2018-04-06 09:55:59 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-04-06 09:55:59 [scrapy.utils.log] INFO: Overridden settings: {‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, 'LOGSTATS_
2018-04-06 09:55:59 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.corestats.CoreStats’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2018-04-06 09:56:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-06 09:56:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-06 09:56:00 [scrapy.core.engine] INFO: Spider opened
2018-04-06 09:56:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000000002AE83C8>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x00000000054A4550>
[s] spider <DefaultSpider ‘default’ at 0x6682e10>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

In [2]: request
Out[2]: <GET http://quotes.toscrape.com/page/1/>
In [3]: response
Out[3]: <200 http://quotes.toscrape.com/page/1/>

3.2. XPATH
xpath语法
Chrome浏览器打开http://quotes.toscrape.com/page/1/,右键->检查

得到的xpath路径(绝对路径)：
/html/body/div/div[2]/div[1]/div[1]/span[1]
运行命令：
In [7]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]’)
Out[7]: []

In [8]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]’).extract()
Out[8]: [’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]

In [9]: response.xpath(’/html/body/div/div[2]/div[1]/div[1]/span[1]/text()’).extract()
Out[9]: [’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’]
使用相对路径
分析这一段：

其内容为:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
由此，得到xpath相对路径：
//div[@class=“quote”]/span[@itemprop=“text”]/text()

In [10]:
response.xpath(’//div[@class=“quote”]/span[@itemprop=“text”]/text()’).extract()
Out[10]:
[’“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”’,
‘“It is our choices, Harry, that show what we truly are, far more than our abilities.”’,
‘“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”’,
‘“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”’,

最低0.47元/天解锁文章

leenuxcore

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy 小白自学笔记

Scrapy 小白自学笔记Scrapy 小白自学笔记Scrapy环境搭建安装scrapypip install scrapy安装pywin32D:>pip install pywin32Collecting pywin32Using cached pywin32-223-cp35-cp35m-win32.whlInstalling collected packages: pywin32Successfully installed pywin32-223创建一个scrapy工程
复制链接

扫一扫