原文地址:https://blog.csdn.net/qq_36135928/article/details/91358913
参考链接:
- 手把手教你使用Python+scrapy爬取山东各城市天气预报
- Python3入门笔记(1) —— windows安装与运行
- pip install scrapy报错,教你如何正确安装scrapy
- NameError: name ‘urlopen’ is not defined
安装Python
安装scrapy
- 更新pip
python -m pip install --upgrade pip
- 安装wheel
pip install wheel
-
安装lxml
lxml文件下载,下载如下图的文件,注意与安装的python版本对应。
下载完成右击文件,选择属性,再选择安全,复制文件路径。 在cmd中输入:pip install 文件路径。 -
安装pyOpenSSL
pyOpenSSL 文件下载
下载完成右击文件,选择属性,再选择安全,复制文件路径。 在cmd中输入:pip install 文件路径。 -
安装Twisted
Twisted文件下载 -
安装pywin32
pywin32下载 ,下载完成安装即可。 -
安装scrapy
cmd中执行命令:pip install scrapy。
第一个爬虫项目
- 创建项目:scrapy startproject sdWeatherSpider
- 进入爬虫项目文件夹,执行下面的命令创建爬虫
- 目录结构
- 进入http://www.weather.com.cn/shandong/index.shtml,右键查看页面源码,找到如图所示位置。
- 进入http://www.weather.com.cn/weather/101120101.shtml,右键查看页面源代码,理解如下图的位置。
- 修改items.py文件,定义要爬取的内容
- 修改爬虫文件everyCityinSD.py,定义如何爬取内容,其中用到的规则参考前面对页面的分析
- 修改pipelines.py文件,把爬取到的数据写入文件weather.txt
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class SdweatherspiderPipeline(object):
def process_item(self, item, spider):
with open('weather.txt','a',encoding='utf8') as fp:
fp.write(item['city']+'\n')
fp.write(item['weather']+'\n\n')
return item
- 修改settings.py文件,分派任务,指定处理数据的程序
# -*- coding: utf-8 -*-
# Scrapy settings for sdWeatherSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'sdWeatherSpider'
SPIDER_MODULES = ['sdWeatherSpider.spiders']
NEWSPIDER_MODULE = 'sdWeatherSpider.spiders'
ITEM_PIPELINES = {
'sdWeatherSpider.pipelines.SdweatherspiderPipeline':1,
}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'sdWeatherSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'sdWeatherSpider.middlewares.SdweatherspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'sdWeatherSpider.middlewares.SdweatherspiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'sdWeatherSpider.pipelines.SdweatherspiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
- 执行命令: scrapy crawl everyCityinSD ,运行爬虫程序
- 完整项目代码:https://download.csdn.net/download/qq_36135928/11232643