创建一个scrapy工程:
scrapy startproject get_url
新建一个爬虫:
/scrapy_project/get_url/get_url/spiders$ scrapy genspider myspider weather.com.cn
编辑myspider.py文件内容如下:
# -*- coding: utf-8 -*-
import scrapy
class MyspiderSpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['weather.com.cn']
start_urls = []
base_url = 'http://www.weather.com.cn/weather/'
# 生成需要爬取的网址
# 直辖市、省会和特别行政区总共有34个
for i in range(1,35,1):
# 直辖市或特别行政区
if i < 5:
# 直辖市或特别行政区里面的城区
for j in range(1,20,1):
num_str = str(101000000 + i*10000 + j*100)
start_urls.append(base_url + num_str + '.shtml')
else:
# j代表省下面有多少个城市
# 最多的为四川省和广东省,都有21个
for j in range(1,23,1):
# k代表一个市下面有多少个县
# 最多为河北>保定,有26个
for k in range(1,27,1):
num_str = str(101000000 + i*10000 + j*100 + k)
start_urls.append(base_url + num_str + '.shtml')
def parse(self, response):
if response.status == 200 and len(response.body) > 1000:
url = response.url
city_code = url[url.rfind('/')+1:url.rfind('.')]
# 小于101050100说明是直辖市或特别行政区,101320000为香港,101330000为澳门
if (int(city_code) < 101050100) or (101320000 < int(city_code) < 101340000):
city = response.xpath('//div[@class="crumbs fl"]/a/text()').extract()[0]
region = response.xpath('//div[@class="crumbs fl"]/span/text()').extract()[1]
city_name = city + '>' + region
else:
province = response.xpath('//div[@class="crumbs fl"]/a/text()').extract()[0]
city = response.xpath('//div[@class="crumbs fl"]/a/text()').extract()[1]
region = response.xpath('//div[@class="crumbs fl"]/span/text()').extract()[2]
city_name = province + '>' + city + '>' + region
with open(r'out.txt','a+') as write_file:
write_file.write('\'' + city_name + '\':\'' + url + '\',' + os.linesep)
return
运行爬虫:
/scrapy_project/get_url/get_url/spiders$ scrapy runspider myspider.py
爬取结束后,在myspider.py同层就会多出一个out.txt文件,内容如下:
'北京>城区':'http://www.weather.com.cn/weather/101010100.shtml',
'北京>通州':'http://www.weather.com.cn/weather/101010600.shtml',
'北京>顺义':'http://www.weather.com.cn/weather/101010400.shtml',
'北京>朝阳':'http://www.weather.com.cn/weather/101010300.shtml',
'北京>怀柔':'http://www.weather.com.cn/weather/101010500.shtml',
......
接下来就可以使用这些地区与网址的对应信息进行天气数据的爬取了。