Scrapy 爬虫
完整项目:https://gitee.com/Hardman233/my-student-base/tree/master/esc
本项目只实现了爬取简单的静态网页并将数据存储到excel里。
如果只是快速启动本项目,可直接从项目启动看起。
该项目只用于学习使用,转载请标明出处,如有错误请指正。
环境安装
1. lxml 安装: pip install lxml
2. pyOpenSSL 安装:pip install pyOpenSSL
3. Twisted 安装:1.http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
2.pip install Twisted‑xxx‑cpxx‑cpxxm‑win_amdxx.whl
注意: cpxx代表python版本 例如 cp37 == python3.7
4. pywin32 安装:pip install pywin32
5. scrapy 安装:pip install scrapy
验证成功
导入scrapy import scrapy 查看是否报错
创建项目
scrapy startproject <projectName> # 创建一个scrapy项目 projectName = 项目名
cd projectName # 进入该项目
scrapy genspider <spiderName> <domain> # 创建爬虫文件 1.spiderName = 爬虫名,
2.domain = 域名
例如:
scrapy startproject esc
cd esc
scrapy genspider ershouche che168.com # 在你的spiders将会生成一个ershouche.py文件
项目启动
cd esc # 进入项目
scrapy crawl ershouche # 启动爬虫
项目结构
spiders/ershouche.py (爬虫)解析数据文件
items.py 结构化数据文件
middlewares.py 整合中间件
pipelines.py (管道)存储数据文件
setting.py 全局配置文件
框架结构
文件说明
只会说明本项目文件中重点的代码的用法
ershouche.py
import scrapy
import re
from esc.items import EscItem
def checkdata(mylist):
i = 0
while i < len(mylist):
list[i] += "万"
i += 1
class ErshoucheSpider(scrapy.Spider):
name = 'ershouche'
allowed_domains = ['che168.com']
start_urls = ['https://www.che168.com/china/bieke/a0_0msdgscncgpi1lto8csp1exx0/']
# 根据规律手动获取url
for link in range(2, 101):
start_urls.append('https://www.che168.com/china/bieke/a0_0msdgscncgpi1lto8csp{lin}exx0/'.format(lin=link))
def start_requests(self):
for url in self.start_urls:
print(url)
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response):
li = response.xpath('/html/body/div[12]/div[1]/ul')
i = 0
flag = 0
car = EscItem()
for get_li in li:
title = get_li.xpath("./li/a/div/h4/text()").extract()
msg = get_li.xpath("./li/a/div/p/text()").extract()
nowpri = get_li.xpath("./li/a/div/div/span/em/text()").extract()
newpri = get_li.xpath("./li/a/div/div/s/text()").extract()
# 处理数据
while i < len(nowpri):
nowpri[i] += '万'
i += 1
for j in range(len(title)):
try:
car["title"] = title[j]
car["msg"] = msg[j]
car["nowpri"] = nowpri[j]
car["newpri"] = newpri[j]
flag += 1
print(title[j], msg[j], nowpri[j], newpri[j], "次数", flag)
yield car
except Exception:
print("异常:", Exception.args)
def parse_detail(self, resp, **kwargs):
pass
属性:
title 用于保存车辆型号及其款式
msg 用于保存车辆信息:公里数、售卖地区、购买车辆时间等
nowpri 用于保存当前二手车售卖价格
newpri 用于保存该型号的车新车购买价格
语句:
start_urls = ['https://www.che168.com/china/bieke/a0_0msdgscncgpi1lto8csp1exx0/']
# 根据规律手动获取url
for link in range(2, 101):
start_urls.append('https://www.che168.com/china/bieke/a0_0msdgscncgpi1lto8csp{lin}exx0/'.format(lin=link))
# 如果想要爬取不同品牌的车辆,只需要将start_urls和append方法中的url中的bieke改为你想爬取的品牌的拼音即可
# 如果想要爬取不同地区的二手车,只需要将start_urls和append方法中的url中的china改为你想爬取的地区的拼音即可
# 该项目默认爬取的地区和品牌是全国、别克
# 处理数据
while i < len(nowpri):
nowpri[i] += '万'
i += 1
# 该代码的目的是:由于获取当前价格只有数字而并没有‘万’字,故拼接
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class EscItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
msg = scrapy.Field()
nowpri = scrapy.Field()
newpri = scrapy.Field()
pass
结构化数据,主要用于规范数据
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from openpyxl import Workbook
class EscPipeline:
def __init__(self):
self.wb = Workbook()
self.ws = self.wb.active
self.ws.append(['型号', '信息', '新车价格', '当前价格'])
# self.f = open("./二手车1.csv", mode="a", encoding="utf-8")
# self.f.append(['型号', '信息', '当前价格', '新车价格'])
def open_spider(self, spider):
pass
def close_spider(self, spider):
# if self.ws:
# self.ws.close()
pass
def process_item(self, item, spider):
line = [item['title'],item['msg'],item['newpri'],item['nowpri']]
self.ws.append(line)
self.wb.save('别克二手车.xlsx')
# txt = str.format('{},{},{},{}\n',item['title'],item['msg'],item['newpri'],item['nowpri'])
# self.ws.write(txt)
return item
数据持久化,主要用于数据存储。
函数:
def process_item(self, item, spider):
line = [item['title'],item['msg'],item['newpri'],item['nowpri']]
self.ws.append(line)
# 修改保存的文件名
self.wb.save('别克二手车.xlsx')
return item
在这里只是用excel对数据进行存储,当然用mysql等数据库存储亦可
由于代码逻辑简单在这里将不在赘述
settings.py
# Scrapy settings for esc project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'esc'
SPIDER_MODULES = ['esc.spiders']
NEWSPIDER_MODULE = 'esc.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'esc (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = "WARNING"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 延时2秒,不能动态改变,时间间隔固定,容易被发现,导致ip被封
DOWNLOAD_DELAY = 2
# 启用后,当从相同的网站获取数据时,Scrapy将会等待一个随机的值,延迟时间为0.5到1.5之间的一个随机值乘以DOWNLOAD_DELAY
RANDOMIZE_DOWNLOAD_DELAY = True
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'esc.middlewares.EscSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'esc.middlewares.EscDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'esc.pipelines.EscPipeline': 300,
}
FEED_EXPORT_ENCODING = 'gb18030'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 2
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
全局配置文件
# 是否遵守爬虫协议
ROBOTSTXT_OBEY = False
# 设置显示日志等级
LOG_LEVEL = "WARNING"
# 延时2秒,不能动态改变,时间间隔固定,容易被发现,导致ip被封
DOWNLOAD_DELAY = 2
# 启用后,当从相同的网站获取数据时,Scrapy将会等待一个随机的值,延迟时间为0.5到1.5之间的一个随机值乘以DOWNLOAD_DELAY
RANDOMIZE_DOWNLOAD_DELAY = True
# 开启管道
ITEM_PIPELINES = {
'esc.pipelines.EscPipeline': 300,
}