Scrapy框架基于crawl爬取京东商品信息爬虫

Items.py文件

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()  # 创建容器
    shop = scrapy.Field()
    shoplink=scrapy.Field()
    price = scrapy.Field()
    comment = scrapy.Field()

Jd.py爬虫文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from jingdong.items import JingdongItem
from scrapy.http import Request
import urllib.request
import re
class JdSpider(CrawlSpider):
    name = 'jd'
    allowed_domains = ['jd.com']
    start_urls = ['http://www.jd.com/']
    '''
    def start_requests(self):
        ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'}
        yield Request('https://search.jd.com/Search?keyword=%E8%BF%9E%E8%A1%A3%E8%A3%99%E5%86%AC%E5%A5%B3&enc=utf-8', headers=ua)
    '''
    rules = (
        Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
    )
    def parse_item(self, response):
        try:
            i = JingdongItem()
            thisurl=response.url #后加
            pat='item.jd.com/(.*?).html'
            x=re.search(pat,thisurl)
            if (x):
                thisid=re.compile(pat).findall(thisurl)[0]
                title=response.xpath('//html/head/title/text()').extract()
                #//div[@class="name"]/a/@title   div[@class="brand-logo"]/a/img
                shop=response.xpath('//div[@class="name"]/a/@title').extract()
                shoplink=response.xpath('//div[@class="name"]/a/@href').extract()
                priceurl='https://p.3.cn/prices/mgets?callback=jQuery9030294&type=1&area=1_72_4137_0&pdtk=&pduid=378203029&pdpin=&pin=null&pdbp=0&skuIds=J_'+str(thisid)
                commenturl='https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv191&productId='+str(thisid)+'&score=0&sortType=5&page=1&pageSize=10&isShadowSku=0&rid=0&fold=1'
                pricedata=urllib.request.urlopen(priceurl).read().decode('utf-8','ignore')
                commentdata=urllib.request.urlopen(commenturl).read().decode('utf-8','ignore')
                pricepat='"p":"(.*?)"'
                commentpat='"goodRateShow":(.*?),'
                price=re.compile(pricepat).findall(pricedata)
                comment=re.compile(commentpat).findall(commentdata)
                if(len(title) and len(shop) and len(shoplink) and len(price) and len(comment)):
                    i['title']=title
                    i['shop']=shop
                    i['shoplink']=shoplink
                    i['price']=price
                    i['comment']=comment
                    '''print(title[0])
                    print(shop[0])
                    print(shoplink[0])
                    print(price[0])
                    print(comment[0])
                    print('-----------')'''
                else:
                    pass
            else:
                pass
            return i
        except Exception as e:
            print(e)

Pipelines.py文件

# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
class JingdongPipeline(object):
    def process_item(self, item, spider):
        conn = pymysql.connect(host='localhost', port=3306, user='root', passwd='123456', db='dd')
        for i in range(0,len(item['title'])):
            title=item['title'][i]
            shop=item['shop'][i]
            shoplink=item['shoplink'][i]
            price=item['price'][i]
            comment=item['comment'][i]
            sql = "insert into jd(title,shop,shoplink,price,comment)values('" + title + "','" + shop + "','" + shoplink + "','" + price + "','" + comment + "')"
            conn.query(sql)
            conn.commit()
        conn.close()
        return item

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Scrapy是一个强大的Python爬虫框架,可以用来爬取各种网站的信息。下面是一个Scrapy爬取京东商品信息的简单示例: 1. 创建Scrapy项目 在命令行中输入以下命令来创建一个Scrapy项目: ``` scrapy startproject jingdong ``` 这将创建一个名为“jingdong”的Scrapy项目。 2. 创建爬虫 在命令行中输入以下命令来创建一个爬虫: ``` scrapy genspider jingdong_spider jd.com ``` 这将在项目中创建一个名为“jingdong_spider”的爬虫,用于爬取jd.com网站上的商品信息。 3. 编写爬虫代码 打开“jingdong_spider.py”文件,添加以下代码: ```python import scrapy class JingdongSpider(scrapy.Spider): name = "jingdong" allowed_domains = ["jd.com"] start_urls = [ "https://list.jd.com/list.html?cat=9987,653,655" ] def parse(self, response): for sel in response.xpath('//ul[@class="gl-warp clearfix"]/li'): item = {} item['name'] = sel.xpath('div[@class="gl-i-wrap"]/div[@class="p-name"]/a/em/text()').extract()[0] item['price'] = sel.xpath('div[@class="gl-i-wrap"]/div[@class="p-price"]/strong/i/text()').extract()[0] yield item ``` 这个简单的爬虫代码将在“https://list.jd.com/list.html?cat=9987,653,655”页面中爬取商品名称和价格,并将它们保存到一个字典中。 4. 运行爬虫 在命令行中输入以下命令来运行爬虫: ``` scrapy crawl jingdong ``` Scrapy将开始爬取京东商品信息,并将结果打印到命令行中。 这只是一个简单的示例,你可以根据自己的需求修改代码以及爬取其他网站的信息。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值