初识scrapy及scrapy 小爬虫程序实练

最新推荐文章于 2022-11-20 12:05:56 发布

birdflyinhigh

最新推荐文章于 2022-11-20 12:05:56 发布

阅读量310

点赞数

分类专栏：爬虫 scrapy

本文链接：https://blog.csdn.net/birdflyinhigh/article/details/79633585

版权

scrapy 同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

爬虫

0 篇文章 0 订阅

订阅专栏

www.lynda.com作为全球一流的在线培训网站，一直以高质量的视频课程著称。其课程内容也是紧跟行业潮流，本人最近学了里面的一门课程Learning Python and Django, 分享一下。

课程链接：https://www.lynda.com/Django-tutorials/Up-Running-Python-Django/386287-2.html

获取lynda.com永久会员的链接：https://item.taobao.com/item.htm?id=557746408785

1. Scrapy介绍
Scrapy，Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中
提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。

Scrapy吸引人的地方在于它是一个框架，任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类，
如BaseSpider、sitemap爬虫等，最新版本又提供了web2.0爬虫的支持。

主要用到的模块：
--urllib2
--requests
--beautiful soup
--lxml
--selenium
----------------------------------------------------------------------------------------------------------
2. scrapy安装
www.scrapy.org
安装步骤：
1. 创建虚拟环境
sudo mkvirtualenv scrapy_py2
workon scrapy_py2
3. 进入虚拟环境
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
pip install scrapy
输入scrapy查看版本
----------------------------------------------------------------------------------------------------------
3. scrapy_spider简例

Scrapy 1.5.0 - no active project

Usage:
scrapy <command> [options] [args]

Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

scrapy version

1. 开始项目
scrapy startproject quotes_spider
├── quotes_spider
│ ├── quotes_spider
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ └── spiders
│ │ └── __init__.py
│ └── scrapy.cfg
└── scrapy_01.txt
2. 创建spider
scrapy genspider [options] <name> <domain>
scrapy genspider quotes quotes.toscrape.com
scrapy genspider example example.com
scrapy list

3. scrapy shell
fetch('http://quotes.toscrape.com')
response
response.css('h1')
response.xpath('h1')
response.xpath('//h1')
response.xpath('//h1/a')
response.xpath('//h1/a/text()')
response.xpath('//h1/a/text()').extract()
response.xpath('//h1/a/text()').extract_first()
response.xpath("//*[@class='tag']").extract_first()
response.xpath("//*[@class='tag-item']/a/text()").extract()
4. 设置spider.py

scrapy crawl quotes
Webmasters use this file to give instructions to robots
bout which pages of the website they should not visit.
It is also called The Robots Exclusion Protocol
需要设置setting.py 里面的#
Obey robots.txt rules
ROBOTSTXT_OBEY = False
----------------------------------------------------------------------------------------------------------
4. xpath语法简介

from scrapy.selector import Selector
sel = Selector()
# 根据tree找出标签
sel.xpath('/html/head/title')
# 找出所有的title标签
sel.xpath('//title')
# 找出所有的text()
sel.xpath('//text()')
# 找出所有的Paragraph
sel.xpath('/html/body/p')
# 找出所有的paragraph
sel.xpath('//p').extract()
sel.xpath('//p[1]').extract() # 从1开始
sel.xpath('//p')[0].extract()
sel.xpath('//p/text()')[0].extract()
sel.xpath('//p/text()')[0].extract_first()
sel.css('h1')
----------------------------------------------------------------------------------------------------------
5. 小测试
----------------------------------------------------------------------------------------------------------
6. scrapy advance spider

"""python
def parse(self, response):
quotes = response.xpath('//*[@class="quote"]')
for quote in quotes:
text = quote.xpath('.//*[@class="text"]/text()').extract_first()
author = quote.xpath('.//*[@itemprop="author"]/text()').extract_first()
tags = quote.xpath('.//*[@class="tag"]/text()').extract()

yield {
'text': text,
'author': author,
'tags': tags
}

next_page_url = response.xpath('//*[@class="next"]/a/@href').extract_first()
abosolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(abosolute_next_page_url)
"""