在做爬foursquare的爬虫时,需要在parse函数里以userid为文件名进行保存,有一种最简单的方法,那就是在构造初始链接时,将id=[userid]作为参数加入到链接中,
start_urls =[
'http://foursquare.com/user/%d?id=%d' %(n,n) for n in range(99660,99665)
]
这个参数会被foursquare的服务器过滤到,依然能访问到正确的链接内容,而这样带参数的链接,又可以在parse里通过response.url来得到userid。
def parse(self,response):
ID=str(response.url).strip().split("id=")[-1]
with open(str(ID)+".txt","w") as fw:
...
程序运行结果如下:
...
2016-09-21 14:05:22 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/kluoma?id=99660
> from <GET https://foursquare.com/user/99660?id=99660>
2016-09-21 14:05:23 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/kluoma?id=99660> (refe
rer: None)
https://foursquare.com/kluoma?id=99660
2016-09-21 14:05:23 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99661?id=99661> (
referer: None)
https://foursquare.com/user/99661?id=99661
2016-09-21 14:05:24 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/user/99664?id=9
9664> from <GET http://foursquare.com/user/99664?id=99664>
2016-09-21 14:05:24 [scrapy] DEBUG: Redirecting (301) to <GET https://foursquare.com/lucasb?id=99663
> from <GET https://foursquare.com/user/99663?id=99663>
2016-09-21 14:05:25 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99664?id=99664> (
referer: None)
https://foursquare.com/user/99664?id=99664
2016-09-21 14:05:26 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/user/99662?id=99662> (
referer: None)
https://foursquare.com/user/99662?id=99662
2016-09-21 14:05:28 [scrapy] DEBUG: Crawled (200) <GET https://foursquare.com/lucasb?id=99663> (refe
rer: None)
https://foursquare.com/lucasb?id=99663
...