ZCC的专栏

每天坚持看书和写作,相信每天的一小步,将会是人生的一大步! 形成、知化、流动、屏读、使用、共享、过滤、重混、互动、追踪、提问、开始!...

Scrapy爬虫爬取天气数据存储为txt和json等多种格式

一、创建Scrrapy项目

scrapy startproject weather
  

   二、 创建爬虫文件

scrapy genspider wuhanSpider wuhan.tianqi.com
 

   三、SCrapy项目各个文件

   (1) items.py

import scrapy


class WeatherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    cityDate = scrapy.Field()
    week = scrapy.Field()
    img = scrapy.Field()
    temperature = scrapy.Field()
    weather = scrapy.Field()
    wind = scrapy.Field()
  

 (2)wuhanSpider.py

# -*- coding: utf-8 -*-
import scrapy
from weather.items import WeatherItem

class WuhanspiderSpider(scrapy.Spider):
    name = "wuHanSpider"
    allowed_domains = ["tianqi.com"]
    citys = ['wuhan','shanghai']
    start_urls = []
    for city in citys:
        start_urls.append('http://' + city + '.tianqi.com/')

    def parse(self, response):
        subSelector = response.xpath('//div[@class="tqshow1"]')
        items = []
        for sub in subSelector:
            item = WeatherItem()
            cityDates = ''
            for cityDate in sub.xpath('./h3//text()').extract():
                cityDates += cityDate
            item['cityDate'] = cityDates
            item['week'] = sub.xpath('./p//text()').extract()[0]
            item['img'] = sub.xpath('./ul/li[1]/img/@src').extract()[0]
            temps = ''
            for temp in sub.xpath('./ul/li[2]//text()').extract():
                temps += temp
            item['temperature'] = temps
            item['weather'] = sub.xpath('./ul/li[3]//text()').extract()[0]
            item['wind'] = sub.xpath('./ul/li[4]//text()').extract()[0]
            items.append(item)
        return items
  (3)pipelines.py,处理spider的结果

import time
import os.path
import urllib2
#将获得的数据存储到txt文件
class WeatherPipeline(object):
    def process_item(self, item, spider):
		today = time.strftime('%Y%m%d', time.localtime())
		fileName = today + '.txt'
		with open(fileName,'a') as fp:
			fp.write(item['cityDate'].encode('utf8') + '\t')
			fp.write(item['week'].encode('utf8') + '\t')
			imgName = os.path.basename(item['img'])
			fp.write(imgName + '\t')
			if os.path.exists(imgName):
				pass
			else:
				with open(imgName, 'wb') as fp:
					response = urllib2.urlopen(item['img'])
					fp.write(response.read())	
			fp.write(item['temperature'].encode('utf8') + '\t')
			fp.write(item['weather'].encode('utf8') + '\t')
			fp.write(item['wind'].encode('utf8') + '\n\n')
			time.sleep(1)
		return item

import time
import json
import codecs
#将获得的数据存储到json文件
class WeatherPipeline(object):
    def process_item(self, item, spider):
		today = time.strftime('%Y%m%d', time.localtime())
		fileName = today + '.json'
		with codecs.open(fileName, 'a', encoding='utf8') as fp:
			line = json.dumps(dict(item), ensure_ascii=False) + '\n'
			fp.write(line)
		return item
import MySQLdb
import os.path
#将获得的数据存储到mysql数据库
class WeatherPipeline(object):
    def process_item(self, item, spider):
		cityDate = item['cityDate'].encode('utf8')
		week = item['week'].encode('utf8') 
		img = os.path.basename(item['img'])
		temperature = item['temperature'].encode('utf8')
		weather = item['weather'].encode('utf8')
		wind = item['wind'].encode('utf8')

		conn = MySQLdb.connect(
				host='localhost',
				port=3306,
				user='crawlUSER',
				passwd='crawl123',
				db='scrapyDB',
				charset = 'utf8')
		cur = conn.cursor()
		cur.execute("INSERT INTO weather(cityDate,week,img,temperature,weather,wind) values(%s,%s,%s,%s,%s,%s)", (cityDate,week,img,temperature,weather,wind))
		cur.close()
		conn.commit()
		conn.close()

		return item

  (4)settings.py,决定 由哪个文件来处理获取的数据

BOT_NAME = 'weather'

SPIDER_MODULES = ['weather.spiders']
NEWSPIDER_MODULE = 'weather.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weather (+http://www.yourdomain.com)'

#### user add
ITEM_PIPELINES = {
'weather.pipelines.WeatherPipeline':1,
'weather.pipelines2json.WeatherPipeline':2,
'weather.pipelines2mysql.WeatherPipeline':3
}


  (5)爬取命令

scrapy crawl wuHanSpider

  (6)结果显示

  1.txt数据

 
   2.json数据

 

  3. 存储到mysql数据库

阅读更多
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012017783/article/details/77801261
个人分类: python爬虫
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

加入CSDN,享受更精准的内容推荐,与500万程序员共同成长!
关闭
关闭