Python新手：第一个爬虫——天气网为例子

最新推荐文章于 2025-01-16 00:25:03 发布

Johnsonzhongyf

最新推荐文章于 2025-01-16 00:25:03 发布

阅读量623

点赞数

分类专栏： python 文章标签： python scrapy

本文链接：https://blog.csdn.net/Johnsonzhongyf/article/details/82223738

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

python小爬虫

这是本人自学视频后写出的第一个爬虫，多亏CSDN上的攻略与视频教学
流程：
首先下载3.6版本python对应的anaconda进行环境集成配置。
下载Pycharm，进行配置
后使用CMD命令行安装scrapy模块：

pip install scrapy

安装成功后运用CMD建立scrapy项目。
进入对应的路径，在路径列表里输入CMD，打开对应路径的命令：

scrapy startproject 项目名

将项目运用Pycharm进行打开，在spider目录下有init.py文件，在此路径建立自己的spider，此处示例为tianqi.py
这里写图片描述
进入自己的spider进行编写：

import scrapy
from WEA.items import TQItem


class TQSpider(scrapy.Spider):
    name = 'TQ'
    start_urls = ['http://lishi.tianqi.com/']

    def parse(self, response):
        for url in response.xpath('//*[@id="tool_site"]/div[2]/ul/li/a/@href').extract():
            if len(url) > 1:
                #print("进入第一个网站")
                try:
                    full_url = response.urljoin(url)
                finally:
                    yield scrapy.Request(full_url, callback=self.parse_city)
            else:
                continue

    def parse_city(self, response):
        for href in (response.xpath('//*[@id="tool_site"]/div[2]/ul/li/a/@href').extract()):
            #print("进入时间界面")
            try:
                full_href = response.urljoin(href)
            finally:
                yield scrapy.Request(full_href, callback=self.parse_weather)

    def parse_weather(self, response):
        item = TQItem()
        city_name = response.xpath('//*[@class="city_title clearfix"]/table/tbody/tr/td/h1/text()').extract()
        low_weather = 0
        high_weather = 0
        iw = 0
        j = 0
        for l_w in response.xpath('//*[@class="tqtongji2"]/ul/li[3]/text()').extract()[1:]:
            low_weather = low_weather + int(l_w)
            iw = iw + 1
        low_weather = low_weather/iw
        for h_w in response.xpath('//*[@class="tqtongji2"]/ul/li[2]/text()').extract()[1:]:
            high_weather = high_weather + int(h_w)
            j = j + 1
        high_weather = high_weather/j
        item['city_name'] = city_name
        item['low_weather'] = low_weather
        item['high_weather'] = high_weather
        yield item

对item.py进行修改，对应导出量进行初始化设置：

import scrapy


class TQItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    city_name = scrapy.Field()
    low_weather = scrapy.Field()
    high_weather = scrapy.Field()
    pass

对pipeline.py进行修改设置，新建一个自己需要的pipeline，然后在其中进行输出设置。
包括设置item里面的变量输出格式，以及保存路径及文件夹。

class WeaPipeline(object):
    def process_item(self, item, spider):
        return item


class WeaFilePipeLine(object):
    def __init__(self):
        self.file = open('d:/weather.txt', 'wb')

    def process_item(self, item,spider):
        line = "%s\t%d\t%d\t" % (item['city_name'],item['low_weather'],item['high_weather'])
        self.file.write(line.encode("utf-8"))
        return item

设置settings.py，主要为修改pipeline的设置，将原始Pipeline设置为不可用，将新建的pipeline设置为可用。（数字越大，代表优先级越高，None代表此Pipeline不工作）

ITEM_PIPELINES = {
    'WEA.pipelines.WeaPipeline': None,
    'WEA.pipelines.WeaFilePipeLine': 300,
}

最后在对应目录下的CMD运行以下命令,即可运行自己的爬虫：

scrapy crawl TQ

几个书写规范要点：

每一个定义的类前空两行
函数中间参数的需要有“, ”间隔
最后一行应该为空

几个函数规范要点：
1. name字段与start_urls必须有
2. 每一个parse函数必须有对应的yield或者return返回值
3. 获取值第一个基本都是不需要的值，我在本次爬虫中运用判断进行的筛选，但也可以运用append进行抛出

最终的小 TIP:
在cmd命令行中使用:

scrapy shell 网址

可以对该网址下的xpath路径进行查询，实验其可行性，不必要每次都运行软件。