scrapy框架

最新推荐文章于 2024-09-29 14:13:49 发布

不要一直敲门

最新推荐文章于 2024-09-29 14:13:49 发布

阅读量120

点赞数

文章标签： python xpath

本文链接：https://blog.csdn.net/qq_40600409/article/details/106885854

版权

SCRAPY爬虫框架入门

爬取中国天气网的一个小demo

建立框架
```
scrapy startproject  myspider[project_name]
```
看一下项目的结构目录

scrapy.cfg: 项目的配置文件

hellospider/: 该项目的python模块。之后您将在此加入代码。

hellospider/items.py:需要提取的数据结构定义文件。

hellospider/middlewares.py: 是和Scrapy的请求/响应处理相关联的框架。

hellospider/pipelines.py: 用来对items里面提取的数据做进一步处理，如保存等。

hellospider/settings.py: 项目的配置文件。

hellospider/spiders/: 放置spider代码的目录。

在items.py中定义自己要抓取的数据

import scrapy


class HellospiderItem(scrapy.Item):
    # 天气
    weather = scrapy.Field()
    # 最高温度
    temperature_up = scrapy.Field()
    # 最低温度
    temperature_low = scrapy.Field()
    # 风力
    wind = scrapy.Field()
    pass

【注】上面类中的title、author、reply就像是字典中的“键”，爬到的数据就像似字典中的“值”。

在spiders目录下创建爬虫文件myspider.py

import scrapy
from ..items import HellospiderItem
import sys


class MySpider(scrapy.Spider):
    # 设置名称
    name = 'myspider'
    start_urls = [
        'http://www.weather.com.cn/weather/101010100.shtml',
    ]

    # 爬取方法
    def parse(self, response):
        for i in response.xpath('//*[@id="7d"]/ul/li[1]'):
            item = HellospiderItem()
            item['weather'] = i.xpath('//*[@id="7d"]/ul/li[1]/p[1]/text()').extract()
            item['temperature_up'] = i.xpath('//*[@id="7d"]/ul/li[1]/p[2]/span/text()').extract()
            item['temperature_low'] = i.xpath('//*[@id="7d"]/ul/li[1]/p[2]/i/text()').extract()
            item['wind'] = i.xpath('//*[@id="7d"]/ul/li[1]/p[3]/i/text()').extract()
            yield item

这里注意一点爬虫文件中指定的name需要与文件名同名,不然在启动时会检测不到这个爬虫

xpath语法可参考：http://www.w3school.com.cn/xpath/xpath_syntax.asp

在pipelines.py中保存数据

import json


class HellospiderPipeline:
    def __init__(self):
        # 打开文件
        self.file = open('../jtianqi.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # item 爬虫返回来的数据,写入文件中
        lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(lines)
        return item
    def close_spider(self, spider):
        # 关闭文件
        self.file.close()

settings.py

打开这个注释,多个爬虫下,后边的数值代表优先级,数字越小优先级越高
```
ITEM_PIPELINES = {
   'hellospider.pipelines.HellospiderPipeline': 300,
}
```

执行爬虫

cd到项目目录下,执行命令scrapy crawl [SpidersFile_name]

D:\python\python8\scrapy>scrapy crawl myspider

爬取到的数据

{'temperature_low': ['25℃'],
 'temperature_up': ['35'],
 'weather': ['晴'],
 'wind': ['3-4级转<3级']}

不要一直敲门

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫