python scrapy 简单教程_Python的Scrapy爬虫框架简单学习笔记

# coding=utf-8

from scrapy.spider import Spider

from getblog.items import BlogItem

from scrapy.selector import Selector

class BlogSpider(Spider):

# 标识名称

name = 'blog'

# 起始地址

start_urls = ['http://www.cnblogs.com/']

def parse(self, response):

sel = Selector(response) # Xptah 选择器

# 选择所有含有class属性,值为‘post_item'的div 标签内容

# 下面的 第2个div 的 所有内容

sites = sel.xpath('//div[@class="post_item"]/div[2]')

items = []

for site in sites:

item = BlogItem()

# 选取h3标签下,a标签下,的文字内容 ‘text()'

item['title'] = site.xpath('h3/a/text()').extract()

# 同上,p标签下的 文字内容 ‘text()'

item['desc'] = site.xpath('p[@class="post_item_summary"]/text()').extract()

items.append(item)

return items

(4)运行,

scrapy crawl blog # 即可

(5)输出文件。

在 settings.py 中进行输出配置。

# 输出文件位置

FEED_URI = 'blog.xml'

# 输出文件格式 可以为 json,xml,csv

FEED_FORMAT = 'xml'

输出位置为项目根文件夹下。

二、基本的 -- scrapy.spider.Spider

(1)使用交互shell

dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"

2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)

2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django

2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}

2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState

2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats

2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines:

2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024

2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081

2014-08-21 04:09:11+0800 [default] INFO: Spider opened

2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) (referer: None)

[s] Available Scrapy objects:

[s] crawler [s] item {}

[s] request

[s] response <200 http://www.baidu.com/>

[s] settings [s] spider [s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

>>>

# response.body 返回的所有内容

# response.xpath('//ul/li') 可以测试所有的xpath内容

More important, if you type response.selector you will access a selector object you can use to

query the response, and convenient shortcuts like response.xpath() and response.css() mapping to

response.selector.xpath() and response.selector.css()

也就是可以很方便的,以交互的形式来查看xpath选择是否正确。之前是用FireFox的F12来选择的,但是并不能保证每次都能正确的选择出内容。

也可使用:

scrapy shell 'http://scrapy.org' --nolog

# 参数 --nolog 没有日志

(2)示例

from scrapy import Spider

from scrapy_test.items import DmozItem

class DmozSpider(Spider):

name = 'dmoz'

allowed_domains = ['dmoz.org']

start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',

'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,'

'']

def parse(self, response):

for sel in response.xpath('//ul/li'):

item = DmozItem()

item['title'] = sel.xpath('a/text()').extract()

item['link'] = sel.xpath('a/@href').extract()

item['desc'] = sel.xpath('text()').extract()

yield item

(3)保存文件

可以使用,保存文件。格式可以 json,xml,csv

scrapy crawl -o 'a.json' -t 'json'

(4)使用模板创建spider

scrapy genspider baidu baidu.com

# -*- coding: utf-8 -*-

import scrapy

class BaiduSpider(scrapy.Spider):

name = "baidu"

allowed_domains = ["baidu.com"]

start_urls = (

'http://www.baidu.com/',

)

def parse(self, response):

pass

这段先这样吧,记得之前5个的,现在只能想起4个来了. :-(

千万记得随手点下保存按钮。否则很是影响心情的(⊙o⊙)!

三、高级 -- scrapy.contrib.spiders.CrawlSpider

例子

#coding=utf-8

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors import LinkExtractor

import scrapy

class TestSpider(CrawlSpider):

name = 'test'

allowed_domains = ['example.com']

start_urls = ['http://www.example.com/']

rules = (

# 元组

Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

Rule(LinkExtractor(allow=('item\.php', )), callback='pars_item'),

)

def parse_item(self, response):

self.log('item page : %s' % response.url)

item = scrapy.Item()

item['id'] = response.xpath('//td[@id="item_id"]/text()').re('ID:(\d+)')

item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()

item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()

return item

其他的还有 XMLFeedSpider

class scrapy.contrib.spiders.XMLFeedSpider

class scrapy.contrib.spiders.CSVFeedSpider

class scrapy.contrib.spiders.SitemapSpider

四、选择器

>>> from scrapy.selector import Selector

>>> from scrapy.http import HtmlResponse

可以灵活的使用 .css() 和 .xpath() 来快速的选取目标数据

关于选择器,需要好好研究一下。xpath() 和 css() ,还要继续熟悉 正则.

当通过class来进行选择的时候,尽量使用 css() 来选择,然后再用 xpath() 来选择元素的熟悉

五、Item Pipeline

Typical use for item pipelines are:

• cleansing HTML data # 清除HTML数据

• validating scraped data (checking that the items contain certain fields) # 验证数据

• checking for duplicates (and dropping them) # 检查重复

• storing the scraped item in a database # 存入数据库

(1)验证数据

from scrapy.exceptions import DropItem

class PricePipeline(object):

vat_factor = 1.5

def process_item(self, item, spider):

if item['price']:

if item['price_excludes_vat']:

item['price'] *= self.vat_factor

else:

raise DropItem('Missing price in %s' % item)

(2)写Json文件

import json

class JsonWriterPipeline(object):

def __init__(self):

self.file = open('json.jl', 'wb')

def process_item(self, item, spider):

line = json.dumps(dict(item)) + '\n'

self.file.write(line)

return item

(3)检查重复

from scrapy.exceptions import DropItem

class Duplicates(object):

def __init__(self):

self.ids_seen = set()

def process_item(self, item, spider):

if item['id'] in self.ids_seen:

raise DropItem('Duplicate item found : %s' % item)

else:

self.ids_seen.add(item['id'])

return item

至于将数据写入数据库,应该也很简单。在 process_item 函数中,将 item 存入进去即可了。

article_wechat2021.jpg?1111

本文原创发布php中文网,转载请注明出处,感谢您的尊重!

相关文章

相关视频

网友评论

文明上网理性发言,请遵守 新闻评论服务协议我要评论

user_avatar.jpg

立即提交

专题推荐5d1ef1e9e866e635.jpg独孤九贱-php全栈开发教程

全栈 100W+

主讲:Peter-Zhu 轻松幽默、简短易学,非常适合PHP学习入门

5d1ef236ca878949.jpg玉女心经-web前端开发教程

入门 50W+

主讲:灭绝师太 由浅入深、明快简洁,非常适合前端学习入门

5d1ef2477c7d7587.jpg天龙八部-实战开发教程

实战 80W+

主讲:西门大官人 思路清晰、严谨规范,适合有一定web编程基础学习

php中文网:公益在线php培训,帮助PHP学习者快速成长!

Copyright 2014-2020 https://www.php.cn/ All Rights Reserved | 苏ICP备2020058653号-1 foot_line.gif

phpcn_erwei.jpg

qq.jpg

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值