爬取京东物品信息
明确目标
爬取商品名字,评论数,商家,价格,并进行简单分析
一, 获得起始搜索url
查看链接可以得到 : https://search.jd.com/Search?keyword=笔记本电脑 京东自营&enc=utf-8
其中keyword为搜索关键字
二,分析搜索页面取得xpath表达式
商品名字: //div[@id="J_goodsList"]/ul[@class="gl-warp clearfix"]/li//div[@class="p-name p-name-type-2"]/a/em/text()
商品评论数://div[@id="J_goodsList"]/ul[@class="gl-warp clearfix"]/li//div[@class="p-commit"]/strong/a/text()
商品商家: //div[@id="J_goodsList"]/ul[@class="gl-warp clearfix"]/li//div[@class="p-shop"]//a/text()
商品价格://div[@id="J_goodsList"]/ul[@class="gl-warp clearfix"]/li//div[@class="p-price"]//i/text()
解析可以得到30条记录,然下滑刷新会再产生30条记录,如此可以猜测是ajax异步加载的所以需要构建js脚本模拟滚动条下滑
这里采用的是中间键的方式
三,获得下一页链接
分析可以得到下一页的链接为:https://search.jd.com/Search?keyword=笔记本电脑 京东自营&enc=utf-8&page=3
page每次加2
四,存储数据
将数据存储到mongodb
五,简单的数据分析
分析价格分布图,分析商家销量,分析笔记本销量
详细代码过程
创建项目jd
items.py
import scrapy
''' 定义数据格式 '''
class JdItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
shop = scrapy.Field()
comments = scrapy.Field()
jd_spider.py
import scrapy
from jd.items import JdItem
from scrapy import Request
class JDSpider(scrapy.Spider):
name = 'jd_notebook'
start_urls = ['https://search.jd.com/Search?keyword=笔记本电脑 京东自营&enc=utf-8']
allowed_domains = ['jd.com']
## 爬个25页
def parse(self, response):
url = response.url
for page in range(1, 50, 2):
new_url = url + '&page={}'.format(page)
yield Request(new_url, callback=self.parse_item)
## 解析页面
def parse_item(self, response):
lists = response.xpath('//div[@id="J_goodsList"]/ul[@class="gl-warp clearfix"]/li')
for li in lists:
item = JdItem()
item['name'] = ''.join(li.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()').extract())
item['price'] = float(li.xpath('.//div[@class="p-price"]//i/text()').extract_first())
item['comments'] = self.deal_comments( li.xpath('.//div[@class="p-commit"]/strong/a/text()').extract_first() )
item['shop'] = li.xpath('.//div[@class="p-shop"]//a/text()').extract_first()
if item['price'] < 1000:
continue
yield item
## 处理评论数
def deal_comments(self, str):
str = str.replace('+', '')
if str[-1] == '万':
comments = int(float(str[:-1]) * 10000)
else:
comments = int(str)
return comments
middlewares.py
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from scrapy.http.response.html import HtmlResponse
import time
class JdSpiderMiddleware(object):