Python入门(安装)——第一个爬虫程序(爬取山东各城市天气信息)

原文地址https://blog.csdn.net/qq_36135928/article/details/91358913

参考链接:

安装Python

安装scrapy

  • 更新pip

python -m pip install --upgrade pip

在这里插入图片描述

  • 安装wheel

pip install wheel

在这里插入图片描述

  • 安装lxml
    lxml文件下载,下载如下图的文件,注意与安装的python版本对应。在这里插入图片描述
    下载完成右击文件,选择属性,再选择安全,复制文件路径。 在cmd中输入:pip install 文件路径。

  • 安装pyOpenSSL
    pyOpenSSL 文件下载
    在这里插入图片描述
    下载完成右击文件,选择属性,再选择安全,复制文件路径。 在cmd中输入:pip install 文件路径。

  • 安装Twisted
    Twisted文件下载
    在这里插入图片描述

  • 安装pywin32
    pywin32下载 ,下载完成安装即可。
    在这里插入图片描述

  • 安装scrapy
    cmd中执行命令:pip install scrapy。

第一个爬虫项目

  • 创建项目:scrapy startproject sdWeatherSpider
    在这里插入图片描述
  • 进入爬虫项目文件夹,执行下面的命令创建爬虫
    在这里插入图片描述
  • 目录结构
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
  • 进入http://www.weather.com.cn/shandong/index.shtml,右键查看页面源码,找到如图所示位置。
    在这里插入图片描述
  • 进入http://www.weather.com.cn/weather/101120101.shtml,右键查看页面源代码,理解如下图的位置。
    在这里插入图片描述
    在这里插入图片描述
  • 修改items.py文件,定义要爬取的内容
  • 修改pipelines.py文件,把爬取到的数据写入文件weather.txt
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class SdweatherspiderPipeline(object):
    def process_item(self, item, spider):
        with open('weather.txt','a',encoding='utf8') as fp:
            fp.write(item['city']+'\n')
            fp.write(item['weather']+'\n\n')
        return item

  • 修改settings.py文件,分派任务,指定处理数据的程序
# -*- coding: utf-8 -*-

# Scrapy settings for sdWeatherSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'sdWeatherSpider'

SPIDER_MODULES = ['sdWeatherSpider.spiders']
NEWSPIDER_MODULE = 'sdWeatherSpider.spiders'

ITEM_PIPELINES = {
    'sdWeatherSpider.pipelines.SdweatherspiderPipeline':1,
    }


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'sdWeatherSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'sdWeatherSpider.middlewares.SdweatherspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'sdWeatherSpider.middlewares.SdweatherspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'sdWeatherSpider.pipelines.SdweatherspiderPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值