Python入门(安装)——第一个爬虫程序（爬取山东各城市天气信息）_使用scrapy框架爬取山东各城市天气预报-CSDN博客

原文地址：https://blog.csdn.net/qq_36135928/article/details/91358913

参考链接：

安装Python

windows下载地址
安装

安装scrapy

更新pip

python -m pip install --upgrade pip

在这里插入图片描述

安装wheel

pip install wheel

在这里插入图片描述

安装lxml
lxml文件下载，下载如下图的文件，注意与安装的python版本对应。
下载完成右击文件，选择属性，再选择安全，复制文件路径。在cmd中输入：pip install 文件路径。
安装pyOpenSSL
pyOpenSSL 文件下载

下载完成右击文件，选择属性，再选择安全，复制文件路径。在cmd中输入：pip install 文件路径。
安装Twisted
Twisted文件下载
安装pywin32
pywin32下载，下载完成安装即可。
安装scrapy
cmd中执行命令：pip install scrapy。

第一个爬虫项目

创建项目：scrapy startproject sdWeatherSpider
进入爬虫项目文件夹，执行下面的命令创建爬虫
目录结构
进入http://www.weather.com.cn/shandong/index.shtml，右键查看页面源码，找到如图所示位置。
进入http://www.weather.com.cn/weather/101120101.shtml，右键查看页面源代码，理解如下图的位置。
修改items.py文件，定义要爬取的内容

修改爬虫文件everyCityinSD.py，定义如何爬取内容，其中用到的规则参考前面对页面的分析

修改pipelines.py文件，把爬取到的数据写入文件weather.txt

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class SdweatherspiderPipeline(object):
    def process_item(self, item, spider):
        with open('weather.txt','a',encoding='utf8') as fp:
            fp.write(item['city']+'\n')
            fp.write(item['weather']+'\n\n')
        return item

修改settings.py文件，分派任务，指定处理数据的程序

# -*- coding: utf-8 -*-

# Scrapy settings for sdWeatherSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'sdWeatherSpider'

SPIDER_MODULES = ['sdWeatherSpider.spiders']
NEWSPIDER_MODULE = 'sdWeatherSpider.spiders'

ITEM_PIPELINES = {
    'sdWeatherSpider.pipelines.SdweatherspiderPipeline':1,
    }


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'sdWeatherSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'sdWeatherSpider.middlewares.SdweatherspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'sdWeatherSpider.middlewares.SdweatherspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'sdWeatherSpider.pipelines.SdweatherspiderPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'