Scrapy爬虫实战二：获取天气信息

最新推荐文章于 2024-07-24 14:36:42 发布

贼贼弟

最新推荐文章于 2024-07-24 14:36:42 发布

阅读量3.6k

点赞数 7

分类专栏： python网络爬虫文章标签：爬虫 python scrapy 实战天气信息

本文链接：https://blog.csdn.net/m0_37728157/article/details/72850335

版权

python网络爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

本文项目采用python3.6版本语言，利用scrapy框架进行爬取。

该项目实现的功能是爬取某城市的天气以及往后预报一周的天气，并将爬取到的信息保存为.txt文件和写入mysql数据库。利用scrapy爬虫就像是做填空题，

只需要在相应的文件里填入相应的内容，连文件名都不用该。下面是本次项目的目录结构：

----weather

----spiders

__init__.py

wuhanSpider.py

__init__.py

items.py

pipelines.py

settings.py

scrapy.cfg

上述目录结构中，没有后缀名的为文件夹，有后缀的为文件。我们需要修改只有wuhanSpider.py、items.py、pipelines.py、settings.py这四个文件。

其中items.py决定爬取哪些项目，wuhanSpider.py决定怎么爬，setting.py决定由谁去处理爬取的内容，pipelines决定爬取后内容怎样处理，这里的

pipelines是将爬取的信息保存在.txt文件中，后面还会提供一个pipelines2mysql.py文件，这个文件是将信息保存到mysql数据库中，小伙伴们可以用

pipelines2mysql文件的内容直接替换pipeLines里内容，也可以将两个文件放在一起，调用的时候改一下名称即可。

1、选择爬取的项目items.py

#决定爬取哪些项目

import scrapy

class WeatherItem(scrapy.Item):
    cityDate=scrapy.Field()
    week=scrapy.Field()
    img=scrapy.Field()
    temperature=scrapy.Field()
    weather=scrapy.Field()
    wind=scrapy.Field()

2、定义怎样爬取wuhanSpider.py

#定义如何爬取
import scrapy
from weather.items import WeatherItem

class WuHanSpider(scrapy.Spider):
    name="wuHanSpider"
    allowed_domains=['tianqi.com']
    citys=['wuhan','shanghai']
    start_urls=[]
    for city in citys:
        start_urls.append('http://'+city+'.tianqi.com')

    def parse(self,response):
        subSelector=response.xpath('//div[@class="tqshow1"]')
        items=[]
        for sub in subSelector:
            item=WeatherItem()
            cityDates=''
            for cityDate in sub.xpath('./h3//text()').extract():
                cityDates+=cityDate
            item['cityDate']=cityDates
            item['week']=sub.xpath('./p//text()').extract()[0]
            item['img']=sub.xpath('./ul/li[1]/img/@src').extract()[0]
            temps=''
            for temp in sub.xpath('./ul/li[2]//text()').extract():
                temps+=temp
            item['temperature']=temps
            item['weather']=sub.xpath('./ul/li[3]//text()').extract()[0]
            item['wind']=sub.xpath('./ul/li[4]//text()').extract()[0]
            items.append(item)
        return items

这部分就是项目的核心了，本项目爬取的网站为 http://wuhan.tianqi.com/ ,采取的是xpath选择器。经常我们需要爬取的信息来自于多个url地址，这个时候

我们需要寻找url的规律，试验可以发现上海的天气url为：http://shanghai.tianqi.com/ ，本文只爬取了武汉和上海两个地区的天气，读者也可以在上面citys

列表中多添加几个城市。

打开网页源代码，如下图所示：

可以发现天气信息都在<div class="tqshow1">标签下，读者重点看下代码里是如何一层一层的找到我们需要爬取的信息的。

3.1、保存爬取的结果为.txt文件pipelines.py

#保存爬取结果
import time
import os.path
from urllib import request

class WeatherPipeline(object):
    def process_item(self,item,spider):
        today=time.strftime('%Y-%m-%d',time.localtime())
        fileName=today+'.txt'	
        with open(fileName,'a') as fp:
            fp.write((item['cityDate']+'\t'))
            fp.write(item['week']+'\t')
            imgName=os.path.basename(item['img'])
            fp.write(imgName+'\t')
            if os.path.exists(imgName):
                pass
            else:
                with open(imgName,'wb') as fp:
                    response=request.urlopen(item['img'])
                    fp.write(response.read())
            fp.write(item['temperature']+'\t')
            fp.write(item['weather']+'\t')
            fp.write(item['wind']+'\t\n')
            time.sleep(1)
        return item

3.2、保存爬取结果进mysql数据库

本项目保存进的mysql数据库名scrapyDB，建表的代码如下：

create table weather(
	id  int auto_increment,
	cityDate char(24),
	week char(6),
	img char(20),
	temperature char(12),
	weather char(20),
	wind char(20),
	PRIMARY KEY(id)
);

pipelines2mysql.py代码为：

import pymysql
import os.path

class WeatherPipeline(object):
    def process_item(self,item,spider):
        cityDate=item['cityDate']
        week=item['week']
        img=os.path.basename(item['img'])
        temperature=item['temperature']
        weather=item['weather']
        wind=item['wind']

        conn=pymysql.connect(
            host='localhost',
            port=3306,
            user='root',
            passwd='yourPassword',
            db='scrapyDB',
            charset='utf8'
            )
        cur=conn.cursor()
        cur.execute("insert into weather(cityDate,week,img,temperature,weather,wind) values (%s,%s,%s,%s,%s,%s)",(cityDate,week,img,temperature,weather,wind))
        cur.close()
        conn.commit()
        conn.close()
        
        return item

4、分派任务的settings.py

BOT_NAME='weather'

SPIDER_MODULES=['weather.spiders']

NEWSPIDER_MODULE='weather.spiders'

ITEM_PIPELINES={'weather.pipelines.WeatherPipeline':1,
                'weather.pipelines2mysql.WeatherPipeline':2}

说明一下， ITEM_PIPELINES中的数字只是一个值，填多少都可以，数字越小越先被执行。

5、配置文件scrapy.cfg

[settings]
default=weather.settings

[deploy]
project=weather

配置文件里的信息说明项目名称以及指定默认分配任务的文件，另外项目里的两个__inti__.py文件都是空文件，保留这两个文件主要是为了让他们所在

的文件夹可以作为python的模块使用。

6、怎么运行

cmd->cd 将文件调到我们项目所在的这一层文件，也就是上面目录结构中scrapy.cfg所在的这一层文件夹，然后输入命令：scrapy crawl wuHanSpider

执行结束后，会在项目根目录下产生“2017-06-01.txt”文件，里面保存的就是近一周的天气预报,也会下载下来相应天气的图片，同时也会保存进mysql

的数据库。这里的wuHanSpider是我们WuHanSpider类中name="wuHanSpider"的值，更改name的值输入的命令也将更改。

本博客有参考《Python网络爬虫实战》一书，该书采用的是python2.x在Linux系统下运行的，采用python3.x在windows下运行的可以参考本博客。

贼贼弟

关注

7
点赞
踩
31

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录