Python Web从入门到精通(一) Scrapy框架爬取天气网并将数据存入数据库

最新推荐文章于 2022-12-15 02:27:45 发布

会飞的铭铭

最新推荐文章于 2022-12-15 02:27:45 发布

阅读量717

点赞数 2

分类专栏： python Web从入门到精通文章标签：程序人生

本文链接：https://blog.csdn.net/yldd_ldd/article/details/114457951

版权

python Web从入门到精通专栏收录该内容

1 篇文章 0 订阅

订阅专栏

创建项目
```
scrapy startproject 项目名
```
个人习惯使用vscode进行编码，相较于pycharm而言，vscode属于轻量级编译器，打开终端，输入以下命令
```
1. scrapy genspider spider名  爬取的网站
2. 例如：scrapy genspider weather https://www.tianqi.com/fuan/
```
此时会在项目的spider文件夹下生成weather.py.
由于爬取的天气网站https://www.tianqi.com/fuan/右键无法查看网页源代码，我自己就先ctrl+s将html页面保存到桌面端，然后再打开，此时便可以右键查看到网页源代码.
以上工作都做完后，我们便可以开始编写核心代码了.

首先书写items.py, 定义我们需要封装的字段名字

class FuanweatherItem(scrapy.Item):
    city = scrapy.Field() # 城市
    date = scrapy.Field() # 日期
    week = scrapy.Field() # 星期几
    Weather = scrapy.Field() # 天气
    minTempeture = scrapy.Field() # 最低温度
    maxTempeture = scrapy.Field() # 最高温度
    wind = scrapy.Field() # 风向

编写weather.py，目的是从相应网站抓取数据，并保存入item中

import scrapy
from fuanWeather.items import FuanweatherItem

class WeatherSpider(scrapy.Spider):
    name = 'weather'
    allowed_domains = ['tianqi.com']
    start_urls = ['https://www.tianqi.com/fuan/']

    def parse(self, response):
        items = []
        # 获取到7天内福安天气所要信息，分别存入每一个数组中     
        citys = response.xpath('//dd[@class="name"]/h2/text()').extract()
        container = response.xpath('//div[@class="day7"]')
        dates = container.xpath('ul[@class="week"]/li/b/text()').extract()
        weeks = container.xpath('ul[@class="week"]/li/span/text()').extract()
        weathers = container.xpath('ul[@class="txt txt2"]/li/text()').extract()
        minTempetures = container.xpath('div[@class="zxt_shuju"]/ul/li/b/text()').extract()
        maxTempetures = container.xpath('div[@class="zxt_shuju"]/ul/li/span/text()').extract()
        winds = container.xpath('ul[@class="txt"]/li/text()').extract()

        # for循环加入items
        for t in range(7):
            try:
                item = FuanweatherItem()
                item['city'] = citys[0]
                item['date'] = dates[t]
                item['week'] = weeks[t]
                item['Weather'] = weathers[t]
                item['minTempeture'] = minTempetures[t]
                item['maxTempeture'] = maxTempetures[t]
                item['wind'] = winds[t]
                items.append(item)
            except IndexError as e:
                exit()
        return items

接下来我们编写pipelines.py文件，目的是处理已经存好的数据，并将其存入数据库中

from itemadapter import ItemAdapter
import pymysql

class FuanweatherPipeline:
    def process_item(self, item, spider):
        city = item['city']
        date = item['date']
        week = item['week']
        Weather = item['Weather']
        minTempeture = item['minTempeture']
        maxTempeture = item['maxTempeture']
        wind = item['wind']

        # 连接本地数据库
        connection = pymysql.connect(
            host='localhost',
            user='root',
            passwd='123',
            db='scrapy',
            charset='utf8mb4',
            cursorclass=pymysql.cursors.DictCursor
        )
        # 插库
        try:
            with connection.cursor() as cursor:
                # 创建更新数据库的sql
                sql = """INSERT INTO WEATHER(city,date,week,Weather,minTempeture,maxTempeture,wind) 
                    VALUES (%s, %s, %s, %s, %s, %s, %s)"""
                # 执行sql
                cursor.execute(sql, (city, date, week, Weather, minTempeture, maxTempeture, wind))
            # 提交插入数据
            connection.commit()
        finally:
            connection.close()
        return item

在setting中配置pipelines：

ITEM_PIPELINES = {
    'fuanWeather.pipelines.FuanweatherPipeline': 300
}

执行项目
```
scrapy crawl weather
```
执行完项目后，查看数据库表格信息,如下图所示:

爬取天气网站过程中所可能遇到的问题
(1) 爬取过程中报如下403错误：
在这里插入图片描述
解决办法：在setting中配置：

# settings.py 文件

# Chrome版
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"

# Firefox版
USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1"

(2) 数据存入mysql中，中文转变为unicode.
解决方法：在setting.py中配置: