python 爬虫框架亚马逊_python爬虫----（6. scrapy框架，抓取亚马逊数据）

最新推荐文章于 2021-02-03 04:17:23 发布

weixin_39694016

最新推荐文章于 2021-02-03 04:17:23 发布

阅读量172

点赞数

文章标签： python 爬虫框架亚马逊

利用xpath()分析抓取数据还是比较简单的，只是网址的跳转和递归等比较麻烦。耽误了好久，还是豆瓣好呀，URL那么的规范。唉，亚马逊URL乱七八糟的.... 可能对url理解还不够.

amazon

├── amazon

│ ├── __init__.py

│ ├── __init__.pyc

│ ├── items.py

│ ├── items.pyc

│ ├── msic

│ │ ├── __init__.py

│ │ └── pad_urls.py

│ ├── pipelines.py

│ ├── settings.py

│ ├── settings.pyc

│ └── spiders

│ ├── __init__.py

│ ├── __init__.pyc

│ ├── pad_spider.py

│ └── pad_spider.pyc

├── pad.xml

└── scrapy.cfg

(1)items.py

from scrapy import Item, Field

class PadItem(Item):

sno = Field()

price = Field()

(2)pad_spider.py

# -*- coding: utf-8 -*-

from scrapy import Spider, Selector

from scrapy.http import Request

from amazon.items import PadItem

class PadSpider(Spider):

name = "pad"

allowed_domains = ["amazon.com"]

start_urls = []

u1 = ‘http://www.amazon.cn/s/ref=sr_pg_‘

u2 = ‘?rh=n%3A2016116051%2Cn%3A!2016117051%2Cn%3A888465051%2Cn%3A106200071&page=‘

u3 = ‘&ie=UTF8&qid=1408641827‘

for i in range(181):

url = u1 + str(i+1) + u2 + str(i+1) + u3

start_urls.append(url)

def parse(self, response):

sel = Selector(response)

sites = sel.xpath(‘//div[@class="rsltGrid prod celwidget"]‘)

items = []

for site in sites:

item = PadItem()

item[‘sno‘] = site.xpath(‘@name‘).extract()[0]

try:

item[‘price‘] = site.xpath(‘ul/li/div/a/span/text()‘).extract()[0]

# 索引异常，说明是新品

except IndexError:

item[‘price‘] = site.xpath(‘ul/li/a/span/text()‘).extract()[0]

items.append(item)

return items

(3)settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for amazon project

# For simplicity, this file contains only the most important settings by

# default. All the other settings are documented here:

# http://doc.scrapy.org/en/latest/topics/settings.html

BOT_NAME = ‘amazon‘

SPIDER_MODULES = [‘amazon.spiders‘]

NEWSPIDER_MODULE = ‘amazon.spiders‘

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = ‘amazon (+http://www.yourdomain.com)‘

USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5‘

FEED_URI = ‘pad.xml‘

FEED_FORMAT = ‘xml‘

(4)结果如下 pad.xml

B00JWCIJ78

￥3199.00

B00E907DKM

￥3079.00

B00L8R7HKA

￥3679.00

B00IZ8W4F8

￥3399.00

B00MJMW4BU

￥4399.00

B00HV7KAMI

￥3799.00

...

(5)数据保存，保存到数据库

...

-- 2014年08月22日04:12:43

weixin_39694016

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 爬虫框架亚马逊_python爬虫----（6. scrapy框架，抓取亚马逊数据）

利用xpath()分析抓取数据还是比较简单的，只是网址的跳转和递归等比较麻烦。耽误了好久，还是豆瓣好呀，URL那么的规范。唉，亚马逊URL乱七八糟的.... 可能对url理解还不够.amazon├──amazon│├──__init__.py│├──__init__.pyc│├──items.py│├──items.pyc│├──msic││├...
复制链接

扫一扫