python scrapy 向parse传递参数、标识

在做爬foursquare的爬虫时,需要在parse函数里以userid为文件名进行保存,有一种最简单的方法,那就是在构造初始链接时,将id=[userid]作为参数加入到链接中,

start_urls =[  
	'http://foursquare.com/user/%d?id=%d' %(n,n) for n in range(99660,99665)  
] 
这个参数会被foursquare的服务器过滤到,依然能访问到正确的链接内容,而这样带参数的链接,又可以在parse里通过response.url来得到userid。

def parse(self,response):  
    ID=str(response.url).strip().split("id=")[-1]  
    with open(str(ID)+".txt","w") as fw:  
        ...  
程序运行结果如下:

...
2016-09-21 14:05:22 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/kluoma?id=99660
> from <GET https://foursquare.com/user/99660?id=99660>
2016-09-21 14:05:23 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/kluoma?id=99660> (refe
rer: None)
https://foursquare.com/kluoma?id=99660
2016-09-21 14:05:23 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99661?id=99661> (
referer: None)
https://foursquare.com/user/99661?id=99661
2016-09-21 14:05:24 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/user/99664?id=9
9664> from <GET http://foursquare.com/user/99664?id=99664>
2016-09-21 14:05:24 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/lucasb?id=99663
> from <GET https://foursquare.com/user/99663?id=99663>
2016-09-21 14:05:25 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99664?id=99664> (
referer: None)
https://foursquare.com/user/99664?id=99664
2016-09-21 14:05:26 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99662?id=99662> (
referer: None)
https://foursquare.com/user/99662?id=99662
2016-09-21 14:05:28 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/lucasb?id=99663> (refe
rer: None)
https://foursquare.com/lucasb?id=99663
...


评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值