Python之scrapy模块基础解析(一)

  • Ubuntu18.04
  • Python3.6
  • Scrapy1.6.0

1 小序

Scrapy是爬取网站数据的框架,实现依据设定规则自动爬取网页数据.
安装

pip3 install Scrapy 

2 例解

新建项目:

cd scrapy_html
scrapy startproject firstscrapy

目录结构:

|-- scrapy_html
|   `-- firstscrapy
|       |-- firstscrapy
|       |   |-- __init__.py
|       |   |-- __pycache__
|       |   |   |-- __init__.cpython-36.pyc
|       |   |   `-- settings.cpython-36.pyc
|       |   |-- items.py
|       |   |-- middlewares.py
|       |   |-- pipelines.py
|       |   |-- settings.py
|       |   `-- spiders
|       |       |-- __init__.py
|       |       |-- __pycache__       
|       `-- scrapy.cfg

spiders文件夹下新建文件scrapy_test.py

  • Demo
    scrapy_test.py
class QuotesSpider(scrapy.Spider):
	name = "quotes"
	start_urls = [
		'http://quotes.toscrape.com/tag/humor',
	]

	def parse(self, response):
		for quote in response.css('div.quote'):
			yield {
				'text': quote.css('span.text::text').get(),
				'author': quote.xpath('span/small/text()').get(),
			}

		next_page = response.css('li.next a::attr("href")').get()
		if next_page is not None:
			yield response.follow(next_page, self.parse)
  • Run
scrapy runspider scrapy_test.py -o json_data.json
scrapy crawl quotes
  • Result
[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"},
{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d", "author": "W.C. Fields"},
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
]
  • Analysis
    (1) 创建Spider类QuotesSpider,继承scrapy.Spider;
    (2) 类属性:
属性描述
name用于区别Spider,名字唯一,不可以为不同的Spider设置相同的名字
start_urls包含Spider在启动时进行爬取的url列表,第一个被获取到的页面是其中之一,后续的URL从初始的URL获取的数据中提取
parse()是spider方法,调用时,每个初始URL完成下载后生成的Response对象将作为参数传给函数,该方法负责解析返回的数据,提取数据(生成item)及生成需要处理的URL的Request对象

(3) 通过命令scrapy runspider scrapy_test.py -o json_data.json运行并将获取的数据存于文件json_data.json中;
(4) Scrapy为Spider的start_urls属性中的URL创建scrapy.Request对象,将parse方法作为回调函数,赋值给Request,Request对象经过调度,执行生成scrapy.http.Response对象传递给spider的parse方法;
(5) 提取Item,scrapy使用基于XPath和CSS表达机制:Scrapy Selectors

属性描述
/html/head/title选择HTML文档中<head>标签内的<title>元素
/html/head/title/text()选择上面的<title>元素的文字
//td选择所有<td>元素
//div[@class=“mine”]选择所有含有class="mine"属性的div元素

(6) Selector方法

属性描述
xpath()传入xpath表达式,返回该表达式对应所有节点的select list
css()传入CSS表达式,返回该表达式对应所有节点的selector list
extract()序列化该节点为unicode字符串,返回list
re()根据传入的正则表达式对数据进行提取,返回unicode字符串list

[参考文献]
[1]https://docs.scrapy.org/en/latest/index.html
[2]https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html
[3]http://www.scrapyd.cn/doc/179.html


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值