html dom游戏,html dom

最新推荐文章于 2022-05-18 08:28:43 发布

weixin_39844901

最新推荐文章于 2022-05-18 08:28:43 发布

阅读量97

点赞数

文章标签： html dom游戏

scrapy 入门学习笔记二

蜘蛛抓回数据需要进行分析。首先要了解一下XPath。XPath 是一门在 XML 文档中查找信息的语言。XPath 用于在 XML 文档中通过元素和属性进行导航。XPath 是 W3C 标准，有空再学习吧。简单看一下html dom(Document Object Model)的文档对象模型.

首先用scrapy shell 进行交互式解析的实验使用命令如下：

# scrapy shell http://www.simonzhang.net/

获得数据并进入shell中

DEBUG: Crawled (200) (referer: None)

[s] Available Scrapy objects:

[s] hxs

修改代码，来抓取头文件中的连接。代码如下：

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

class DmozSpider (BaseSpider):

name = "simonzhang"

allowed_domains = ["simonzhang.net"]

start_urls = [

"http://www.simonzhang.net/"]

def parse (self,response):

hxs = HtmlXPathSelector(response)

site_code = hxs.select('//html/head')

for l in site_code:

_link = l.select('link/@href').extract()

print "================"

print _link

print "================"

运行蜘蛛，得到的抓取的结果。获得的结果需要保存，这时就用到item。item对象是python的字典，将字段和值进行关联。编辑spiders上层的items.py文件。

from scrapy.item import Item, Field

class ScrapytestItem(Item):

# define the fields for your item here like:

# name = Field()

title = Field()

head_link = Field()

head_meta = Field()

pass

然后需要修改蜘蛛文件，代码为：

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from simonzhang.items import ScrapytestItem #ScrapytestItem引用上级目录里的类

class DmozSpider (BaseSpider):

name = "simonzhang"

allowed_domains = ["simonzhang.net"]

start_urls = [

"http://www.simonzhang.net/"]

def parse (self,response):

hxs = HtmlXPathSelector(response)

site_code = hxs.select('//html/head')

items = []

for l in site_code:

item = ScrapytestItem()

item['title'] = l.select('title/text()').extract()

item['head_link'] = l.select('link/@href').extract()

item['head_meta'] = l.select('meta').extract()

items.append(item)

return items

运行一下命令进行抓取，成功后就会在同级目录产生一个json的文件，里面保存的是抓取的内容。对于抓取小型的项目足够用了。

scrapy crawl simonzhang -o simonzhang.json -t json

上条命令里的“-o”为指定输出文件，“-t”为指定输出格式，更多的参数，可以使用“scrapy crawl –help”参考。

scrapy 入门学习笔记一

http://www.simonzhang.net/wp-admin/post.php?post=1108&action=edit

weixin_39844901

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。