scrapy爬虫实战——爬取京东男装商品信息

最新推荐文章于 2021-06-09 13:51:04 发布

weixin_44516568

最新推荐文章于 2021-06-09 13:51:04 发布

阅读量1.5k

点赞数 2

文章标签： Python scrapy爬虫实战项目

本文链接：https://blog.csdn.net/weixin_44516568/article/details/97611818

版权

一、scrapy爬虫实战项目要求——爬取京东男装商品信息

1.工具：使用scrapycrawl爬虫模板

2.内容：爬取商品名称、商家名称、评分、价格（对应每一种颜色和尺码，数量=1时的价格）、多张图片

3.提示：容易被封ip，需做好防范

二、完成爬虫项目的框架构思

1.创建爬虫项目：scrapy startproject jingdong

2.创建爬虫文件（由于京东网商品信息中网页链接较多，因此选用爬取链接更为方便的crawl爬虫模板）：

scrapy genspider -t crawl jdspider "https://search.jd.com/Search?keyword=男装&enc=utf-8&suggest=1.his.0.0&wq=&pvid=c02b7f8cf5b3446aa601a21c61c5db8b"

3.修改settings配置文件：①设置ROBOTSTXT_OBEY = False，即不遵循目标网页的爬取规定，否则将无法爬取任何有用信息。

②设置下载延迟、浏览器信息头、代理ip

③开启管道（因为涉及到爬取文本信息和图片，因此要设置两个管道）

④设置图片下载路径

4.设置items文件：将需要获取的网页信息设置在items文件中

5.编写jdspider爬虫文件

6.设置pipelines管道文件：设置存储文本和图片的管道

核心思想：通过各种途径得到items文件中设置的所有内容

三、项目源代码及编程思想

1.settings

# -*- coding: utf-8 -*-

# Scrapy settings for jingdong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jingdong'

SPIDER_MODULES = ['jingdong.spiders']
NEWSPIDER_MODULE = 'jingdong.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jingdong (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  #不遵循目标网页规定

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3    #设置下载延时为3s，防止访问过快被网站识别为恶意程序
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
}  #浏览器信息头，模拟浏览器的正常访问，可多设置几个不同浏览器的信息头并随机使用，可降低被网站拦截概率

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'jingdong.middlewares.JingdongSpiderMiddleware': 543,
#}
IMAGES_STORE = 'C:\\Users\\Administrator\\PycharmProjects\\jingdong\\jingdong\\pic'
#图片下载路径，注意此处需打两个反斜杠\\，因为\U为关键字
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'jingdong.middlewares.JingdongDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.

最低0.47元/天解锁文章

weixin_44516568

关注

2
点赞
踩
14

收藏

觉得还不错? 一键收藏
1
评论
scrapy爬虫实战——爬取京东男装商品信息

一、scrapy爬虫实战项目要求——爬取京东男装商品信息1.工具：使用scrapycrawl爬虫模板2.内容：爬取商品名称、商家名称、评分、价格（对应每一种颜色和尺码，数量=1时的价格）、多张图片3.提示：容易被封ip，需做好防范二、完成爬虫项目的框架构思1.创建爬虫项目：scrapy startproject jingdong2.创建爬虫文件（由于京东网商品信息中网页链接...
复制链接

扫一扫