一、scrapy爬虫实战项目要求——爬取京东男装商品信息
1.工具:使用scrapycrawl爬虫模板
2.内容:爬取商品名称、商家名称、评分、价格(对应每一种颜色和尺码,数量=1时的价格)、多张图片
3.提示:容易被封ip,需做好防范
二、完成爬虫项目的框架构思
1.创建爬虫项目:scrapy startproject jingdong
2.创建爬虫文件(由于京东网商品信息中网页链接较多,因此选用爬取链接更为方便的crawl爬虫模板):
scrapy genspider -t crawl jdspider "https://search.jd.com/Search?keyword=男装&enc=utf-8&suggest=1.his.0.0&wq=&pvid=c02b7f8cf5b3446aa601a21c61c5db8b"
3.修改settings配置文件:①设置ROBOTSTXT_OBEY = False,即不遵循目标网页的爬取规定,否则将无法爬取任何有用信息。
②设置下载延迟、浏览器信息头、代理ip
③开启管道(因为涉及到爬取文本信息和图片,因此要设置两个管道)
④设置图片下载路径
4.设置items文件:将需要获取的网页信息设置在items文件中
5.编写jdspider爬虫文件
6.设置pipelines管道文件:设置存储文本和图片的管道
核心思想:通过各种途径得到items文件中设置的所有内容
三、项目源代码及编程思想
1.settings
# -*- coding: utf-8 -*-
# Scrapy settings for jingdong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jingdong'
SPIDER_MODULES = ['jingdong.spiders']
NEWSPIDER_MODULE = 'jingdong.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jingdong (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False #不遵循目标网页规定
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3 #设置下载延时为3s,防止访问过快被网站识别为恶意程序
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
} #浏览器信息头,模拟浏览器的正常访问,可多设置几个不同浏览器的信息头并随机使用,可降低被网站拦截概率
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'jingdong.middlewares.JingdongSpiderMiddleware': 543,
#}
IMAGES_STORE = 'C:\\Users\\Administrator\\PycharmProjects\\jingdong\\jingdong\\pic'
#图片下载路径,注意此处需打两个反斜杠\\,因为\U为关键字
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'jingdong.middlewares.JingdongDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.