天气后报网——数据爬取(Scrapy框架)
1.创建天气后报网爬虫
在开始编程之前,我们首先要根据项目需求对天气后报网站进行分析。目标是提取2016-2020年每个城市的每天的温度、天气状况、风力风向等数据。首先来到天气后报网(http://www.tianqihoubao.com/lishi/)。如图1所示。
图 1
可以看到列表中每个省份下的城市信息,以北京市为例,点击进去,进入二级页面。
、 图 2
以2011年1月北京天气为例,进入三级页面(详情页面),其中可以看到日期、天气状况、气温、风力风向等所需的信息。
图 3
以上将整个爬虫项目的流程分析完成,编程可以开始了。首先在命令行中切换到用于存储项目的路径,然后输入下面命令创建爬虫项目和爬虫模块:
1 scrapy startproject tqhbCrawl
2 cd tqhbCrawl
3 scrapy genspider -t crawl tqhb_spider "tianqihoubao.com/lishi/"
2.定义Item
创建完工程后,首先要做的是定义Item,确定我们需要提取的结构化数据。代码如下:
1 import scrapy
2
3 class TqhbItem(scrapy.Item):
4 # 城市名
5 city_name = scrapy.Field()
6 # 日期
7 date = scrapy.Field()
8 # 天气状况
9 state = scrapy.Field()
10 # 风力风向
11 wind = scrapy.Field()
12 #温度
13 temp = scrapy.Field()
3.编写爬虫模块
通过genspider命令已经创建了一个基于CrawlSpider 类的爬虫模板,类名称为 TqhbSpiderSpider,下面进行开始页面解析,主要有两个方法。detail_url 方法用于解析图2所示的列表信息,抽取三级页面url的链接信息。parse 方法用于抽取图3所示的基本的信息。对于二级页面链接的抽取,则是在 rules 中定义抽取规则(只能抽取start_urls中符合 rules 的链接,故需要使用 detail_url 方法构造三级链接 TqhbSpiderSpider完整代码如下:
1 class TqhbSpiderSpider(CrawlSpider):
2 name = 'tqhb_spider'
3 allowed_domains = ['tianqihoubao.com']
4 start_urls = ['http://tianqihoubao.com/lishi']
5
6 rules = (
7 Rule(LinkExtractor(allow='.+lishi.+html'),callback="detail_url",follow=False),
8 )
9
10 def detail_url(self, response):
11 base_url = "http://tianqihoubao.com"
12 divs = response.xpath("//div[@class='box pcity']")[5:9]
13 detail_urls = divs.xpath(".//a/@href").getall()
14 for detail_url in detail_urls:
15 yield scrapy.Request(base_url+detail_url,callback=self.parse)
16
17 def parse(self, response):
18 # 获取 城市 日期 天气状态 气温 风力风向信息
19 city_name = response.xpath('//div[@id="s-calder"]/h2/text()').get()
20 city_name = ''.join(re.findall(r'[^0-9]', city_name))[:-9]
21 trs = response.xpath("//tr")[1:]
22 for tr in trs:
23 tds = tr.xpath(".//td")
24 date = tds[0].xpath(".//text()").getall()
25 date = "".join(''.join(date).split())
26 state = tds[1].xpath(".//text()").getall()
27 state = "".join(''.join(state).split())
28 temp = tds[2].xpath(".//text()").getall()
29 temp = "".join(''.join(temp).split())
30 wind = tds[3].xpath(".//text()").getall()
31 wind = "".join(''.join(wind).split())
32 item = TqhbItem(
33 city_name=city_name,
34 date=date,
35 state=state,
36 temp=temp,
37 wind=wind)
38 yield item
4.Pipeline
下面开始编写Pipeline,主要完成 Item 到 SCV 表的存储。
1 class TqhbPipeline(object):
2 def __init__(self):
3 self.fp = open("tqhb.csv", 'wb')
4 self.exporter = CsvItemExporter(
5 self.fp, encoding='utf-8')
6
7 def open_spider(self, spider):
8 print("爬虫开始....")
9
10
11 def process_item(self, item, spider):
12 self.exporter.export_item(item)
13 return item
14
15 def close_spider(self, spider):
16 self.fp.close()
17 print("爬虫结束了....")
最后在 settings 中将如下代码的注释取消掉:
1 ITEM_PIPELINES = {
2 'tqhb.pipelines.TqhbPipeline': 300,
3 }
5.应对反爬虫机制
为了不被反爬虫机制检测到,主要采用了伪造随机 ‘User-Agent’、自动限速、禁用 robots.txt 等措施。
1.伪造随机 User-Agent,编写middlewares.py
1 import random
2
3 class TqhbDownloaderMiddleware(object):
4 user_agents = [
5 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.23 Safari/537.36",
6 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36",
7 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36",
8 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.27 Safari/537.36",
9 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
10 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4280.87 Safari/537.36",
11 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.20 Safari/537.36",
12 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
13 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.57",
14 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.54",
15 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.50",
16 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.6",
17 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.8",
18 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.14"
19 ]
20 def process_request(self,request,spider):
21 user_agent = random.choice(self.user_agents)
22 request.headers["User-Agent"]=user_agent
并使用该中间件设置DEFAULT_REQUEST_HEADERS:
1 DOWNLOADER_MIDDLEWARES = {
2 'tqhb.middlewares.TqhbDownloaderMiddleware': 543,
3 }
4
5 DEFAULT_REQUEST_HEADERS = {
6 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
7 'Accept-Language': 'en',
8 }
2.自动限速设置:
1 DOWNLOAD_DELAY = 1
2 AUTOTHROTTLE_ENABLED = True
3 AUTOTHROTTLE_START_DELAY = 5
4 AUTOTHROTTLE_MAX_DELAY = 60
3.禁用禁用 robots.txt
1 ROBOTSTXT_OBEY = False
6.运行项目
在项目文件下创建start.py,代码如下:
1 from scrapy import cmdline
2
3 cmdline.execute("scrapy crawl tqhb_spider".split())
存储效果
以上所有代码皆可在本人github账号上下载: https://github.com/chyhoo/2016-2020Chinese-Weather-Analysis