1.爬虫的入门教程:
1.创建项目:
scrapy startproject tutorial
tutorial,你的目录名称,可以随便写的
2.定义Item:
Item 用来保存数据,类似于dict
唯一不同,item 会修正代码的错误(例如,未定义字段)
代码:
import scrapy
class DoubanMovieItem(scrapy.Item):
# 排名
ranking = scrapy.Field()
# 电影名称
movie_name = scrapy.Field()
# 评分
score = scrapy.Field()
# 评论人数
score_num = scrapy.Field()
注释:
创建一个 scrapy.Item 类,类名随便起,继承自scrapy.Item
里面定义属性,属性的类型 scrapy.Field()
3.开始编写爬虫:
爬虫,和item 一样,也是一个类 ---需要继承自scrapy.Spider
这个类的属性:
1.name : 必须的,唯一ID
2.start_urls:
爬虫第一个需要爬取的页面,数组类型
代码:
tart_urls = ['https://woodenrobot.me']
或者使用替代品:
from scrapy import Request
def start_requests(self):
url = 'https://movie.douban.com/top250'
yield Request(url, headers=self.headers)
注释:
使用request的方式来写,可以设置header
3.parse()
def parse(self, response):
...
下载完的页面数据被放在response,传递进去parse()函数里面(Response 对象)
在这个方法下面进行数据的处理(解析数据)
生成Item
生成下一个需要爬取的链接,是一个request对象
代码:
def parse(self, response):
#数据的处理
movies = response.xpath('//ol[@class="grid_view"]/li')
#生成item
item = DoubanMovieItem()
for movie in movies:
item['ranking'] = movie.xpath(
'.//div[@class="pic"]/em/text()').extract()[0]
item['movie_name'] = movie.xpath(
'.//div[@class="hd"]/a/span[1]/text()').extract()[0]
item['score'] = movie.xpath(
'.//div[@class="star"]/span[@class="rating_num"]/text()'
).extract()[0]
item['score_num'] = movie.xpath(
'.//div[@class="star"]/span/text()').re(ur'(\d+)人评价')[0]
yield item
#生成需要进一步处理的URL的 Request 对象。
next_url = response.xpath('//span[@class="next"]/a/@href').extract()
if next_url:
next_url = 'https://movie.douban.com/top250' + next_url[0]
yield Request(next_url, headers=self.headers)
4.进行爬取:
scrapy crawl 爬虫的名字...
执行的过程分析:
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened
2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)
执行过程解析:
1.scrapy为 Spider的 start_urls属性(列表),创建了一个scrapy.Requeat对象
2.并将parse()方法,作为回调函数(callback ),返回给了Request对象
3.Requeat对象,经过调度,执行生成scrapy.http.Response对象
将响应的response对象,作为参数,传递给parse()方法
5.提取item (获取你需要的数据)
Scrapy Selectors 选择器的使用
使用scrapy Selectors:的 几个基本方法
1.xpath():
返回列表
2.css():
返回列表
3.extract():
extract:选取,提取,获得
将节点 序列化为unicode的格式,返回列表
4.re()
根据传入的正则表达式,提取数据
返回unicode 编码的list 列表
关于response返回对象的结构:
response.body"
包体
response.header:
包头
response.selector :
选择器
response.selector.css()
response.selector.xpath()
注释:
模块提供了一些简单的方法,例如 response.xpath() 或 response.css()
关于xpath方法的嵌套:
代码:
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
为什么可以这样:
每个 .xpath() 调用返回selector组成的list
response.xpath() 返回一个selector列表
列表的每一项,就是一个选择器
选择器.xpath () ...如此循环往复
6.使用item 就是给item赋值:
代码:
item = DoubanMovieItem()
for data in datas:
item['ranking'] = data['rank']
item['movie_name'] = data['title']
item['score'] = data['score']
item['score_num'] = data['vote_count']
yield item
先引入 含有item的模块:
from scrapyspider.items import DoubanMovieItem
实例化,item 的类:
item = DoubanMovieItem()
对实例进行赋值和其他的操作
...
7.追踪链接:
就是获取到当前页面,连接到其他页面的链接 ...(追踪这些链接...)
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
url = response.urljoin(response.url, href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
我的代码:
yield Request(next_url, headers=self.headers)
yield scrapy.Request(url, callback=self.parse_dir_contents)
每一次request,就会执行一次parse()...
8.爬虫的几种常见的格式;
1.从入口地址,获取url列表
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/",
]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
url = response.urljoin(response.url, href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
2.页面结构相同,执行类似切换页面的操作:
代码模式:
def parse_articles_follow_next_page(self, response):
for article in response.xpath("//article"):
item = ArticleItem()
... extract article data here
yield item
next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_articles_follow_next_page)
分页操作实例:
from scrapy.spiders import Spider
from scrapyspider.items import DoubanMovieItem
from scrapy import Request
class DoubanMovieTop250Spider(Spider):
name = 'douban_movie_top250'
start_urls = ['https://movie.douban.com/top250']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
}
'''
start_urls = ['http://woodenrobot.me']
'''
def start_requests(self):
'''
start_requests() 该方法必须返回一个可迭代对象(iterable)。该对象包含了spider用于爬取的第一个Request。
'''
url = 'https://movie.douban.com/top250'
'''
行到 yield b 时,fab 函数就返回一个迭代值,下次迭代时,代码从 yield b 的下一条语句继续执行,
而函数的本地变量看起来和上次中断执行前是完全一样的,于是函数继续执行,直到再次遇到 yield
'''
yield Request(url, headers=self.headers)
def parse(self, response):
with open('./data.html', 'wb') as f:
f.write(response.body)
item = DoubanMovieItem()
movies = response.xpath('//ol[@class="grid_view"]/li')
for movie in movies:
item['ranking'] = movie.xpath(
'.//div[@class="pic"]/em/text()').extract()[0]
item['movie_name'] = movie.xpath(
'.//div[@class="hd"]/a/span[1]/text()').extract()[0]
item['score'] = movie.xpath(
'.//div[@class="star"]/span[@class="rating_num"]/text()'
).extract()[0]
yield item
#extract()返回的是数组 extract_first() 返回的是字符串
next_url = response.xpath('//span[@class="next"]/a/@href').extract()
print(next_url,'next_url')
if next_url:
next_url = 'https://movie.douban.com/top250' + next_url[0]
yield Request(next_url, headers=self.headers) #迭代器本身就是一个循环
9.保存爬取到的数据:
scrapy crawl dmoz -o items.json
dmoz:爬虫的名字 name属性对应的
items.json :生成 items.json 文件
1.创建项目:
scrapy startproject tutorial
tutorial,你的目录名称,可以随便写的
2.定义Item:
Item 用来保存数据,类似于dict
唯一不同,item 会修正代码的错误(例如,未定义字段)
代码:
import scrapy
class DoubanMovieItem(scrapy.Item):
# 排名
ranking = scrapy.Field()
# 电影名称
movie_name = scrapy.Field()
# 评分
score = scrapy.Field()
# 评论人数
score_num = scrapy.Field()
注释:
创建一个 scrapy.Item 类,类名随便起,继承自scrapy.Item
里面定义属性,属性的类型 scrapy.Field()
3.开始编写爬虫:
爬虫,和item 一样,也是一个类 ---需要继承自scrapy.Spider
这个类的属性:
1.name : 必须的,唯一ID
2.start_urls:
爬虫第一个需要爬取的页面,数组类型
代码:
tart_urls = ['https://woodenrobot.me']
或者使用替代品:
from scrapy import Request
def start_requests(self):
url = 'https://movie.douban.com/top250'
yield Request(url, headers=self.headers)
注释:
使用request的方式来写,可以设置header
3.parse()
def parse(self, response):
...
下载完的页面数据被放在response,传递进去parse()函数里面(Response 对象)
在这个方法下面进行数据的处理(解析数据)
生成Item
生成下一个需要爬取的链接,是一个request对象
代码:
def parse(self, response):
#数据的处理
movies = response.xpath('//ol[@class="grid_view"]/li')
#生成item
item = DoubanMovieItem()
for movie in movies:
item['ranking'] = movie.xpath(
'.//div[@class="pic"]/em/text()').extract()[0]
item['movie_name'] = movie.xpath(
'.//div[@class="hd"]/a/span[1]/text()').extract()[0]
item['score'] = movie.xpath(
'.//div[@class="star"]/span[@class="rating_num"]/text()'
).extract()[0]
item['score_num'] = movie.xpath(
'.//div[@class="star"]/span/text()').re(ur'(\d+)人评价')[0]
yield item
#生成需要进一步处理的URL的 Request 对象。
next_url = response.xpath('//span[@class="next"]/a/@href').extract()
if next_url:
next_url = 'https://movie.douban.com/top250' + next_url[0]
yield Request(next_url, headers=self.headers)
4.进行爬取:
scrapy crawl 爬虫的名字...
执行的过程分析:
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Spider opened
2014-01-23 18:13:08-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-23 18:13:09-0400 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-01-23 18:13:09-0400 [scrapy] INFO: Closing spider (finished)
执行过程解析:
1.scrapy为 Spider的 start_urls属性(列表),创建了一个scrapy.Requeat对象
2.并将parse()方法,作为回调函数(callback ),返回给了Request对象
3.Requeat对象,经过调度,执行生成scrapy.http.Response对象
将响应的response对象,作为参数,传递给parse()方法
5.提取item (获取你需要的数据)
Scrapy Selectors 选择器的使用
使用scrapy Selectors:的 几个基本方法
1.xpath():
返回列表
2.css():
返回列表
3.extract():
extract:选取,提取,获得
将节点 序列化为unicode的格式,返回列表
4.re()
根据传入的正则表达式,提取数据
返回unicode 编码的list 列表
关于response返回对象的结构:
response.body"
包体
response.header:
包头
response.selector :
选择器
response.selector.css()
response.selector.xpath()
注释:
模块提供了一些简单的方法,例如 response.xpath() 或 response.css()
关于xpath方法的嵌套:
代码:
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
为什么可以这样:
每个 .xpath() 调用返回selector组成的list
response.xpath() 返回一个selector列表
列表的每一项,就是一个选择器
选择器.xpath () ...如此循环往复
6.使用item 就是给item赋值:
代码:
item = DoubanMovieItem()
for data in datas:
item['ranking'] = data['rank']
item['movie_name'] = data['title']
item['score'] = data['score']
item['score_num'] = data['vote_count']
yield item
先引入 含有item的模块:
from scrapyspider.items import DoubanMovieItem
实例化,item 的类:
item = DoubanMovieItem()
对实例进行赋值和其他的操作
...
7.追踪链接:
就是获取到当前页面,连接到其他页面的链接 ...(追踪这些链接...)
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
url = response.urljoin(response.url, href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
我的代码:
yield Request(next_url, headers=self.headers)
yield scrapy.Request(url, callback=self.parse_dir_contents)
每一次request,就会执行一次parse()...
8.爬虫的几种常见的格式;
1.从入口地址,获取url列表
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/",
]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
url = response.urljoin(response.url, href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
2.页面结构相同,执行类似切换页面的操作:
代码模式:
def parse_articles_follow_next_page(self, response):
for article in response.xpath("//article"):
item = ArticleItem()
... extract article data here
yield item
next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_articles_follow_next_page)
分页操作实例:
from scrapy.spiders import Spider
from scrapyspider.items import DoubanMovieItem
from scrapy import Request
class DoubanMovieTop250Spider(Spider):
name = 'douban_movie_top250'
start_urls = ['https://movie.douban.com/top250']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
}
'''
start_urls = ['http://woodenrobot.me']
'''
def start_requests(self):
'''
start_requests() 该方法必须返回一个可迭代对象(iterable)。该对象包含了spider用于爬取的第一个Request。
'''
url = 'https://movie.douban.com/top250'
'''
行到 yield b 时,fab 函数就返回一个迭代值,下次迭代时,代码从 yield b 的下一条语句继续执行,
而函数的本地变量看起来和上次中断执行前是完全一样的,于是函数继续执行,直到再次遇到 yield
'''
yield Request(url, headers=self.headers)
def parse(self, response):
with open('./data.html', 'wb') as f:
f.write(response.body)
item = DoubanMovieItem()
movies = response.xpath('//ol[@class="grid_view"]/li')
for movie in movies:
item['ranking'] = movie.xpath(
'.//div[@class="pic"]/em/text()').extract()[0]
item['movie_name'] = movie.xpath(
'.//div[@class="hd"]/a/span[1]/text()').extract()[0]
item['score'] = movie.xpath(
'.//div[@class="star"]/span[@class="rating_num"]/text()'
).extract()[0]
yield item
#extract()返回的是数组 extract_first() 返回的是字符串
next_url = response.xpath('//span[@class="next"]/a/@href').extract()
print(next_url,'next_url')
if next_url:
next_url = 'https://movie.douban.com/top250' + next_url[0]
yield Request(next_url, headers=self.headers) #迭代器本身就是一个循环
9.保存爬取到的数据:
scrapy crawl dmoz -o items.json
dmoz:爬虫的名字 name属性对应的
items.json :生成 items.json 文件