读书笔记(7)——scrapy进阶

125940_wVOO_3629884.png

之前运行爬虫文件的时候碰到问题,通过查阅百度,需要在手动安装该.py文件(在scripy文件夹内)

130117_N8ij_3629884.png

首先使用一下命令生成一个名为DDrun的爬虫项目

scrapy startproject DDrun

cd到DDrun\DDrun\spiders目录下,然后输入一下命令,创建一个爬虫文件。

scrapy genspider -t basic DDsipder dangdang.com

创建爬虫项目及爬虫文件后,先在items.py编写所需要的容器:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DdrunItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #定义title存储商品标题
    title = scrapy.Field()
    #定义link存储商品链接
    link = scrapy.Field()
    #定义price存储商品价格
    price = scrapy.Field()
    #定义author存储商品作者
    author = scrapy.Field()

再在爬虫文件(DDsipder.py)编辑爬虫代码,在这里我们首先分析要爬取的网页的HTML:

http://category.dangdang.com/cp01.16.00.00.00.00.html
#这是当当网两性书籍的第一页的网址
http://category.dangdang.com/pg2-cp01.16.00.00.00.00.html
#这是当当网两性书籍的第二页的网址,我们可以发现第一页和第二页主要少了一个"pg2-"
#将第一页的网址改成:
http://category.dangdang.com/pg1-cp01.16.00.00.00.00.html
#打开此网址,我们发现这就是当当网的两性书籍的第一页。所以可以判定,每一个只是pg后的值不一样

'''
<a class="pic" title=" 男人来自火星,女人来自金星(1-4套装)(升级版) " ddclick="act=normalResult_picture&pos=23706011_0_2_p" name="itemlist-picture" dd_name="单品图片" href="http://product.dangdang.com/23706011.html" target="_blank">
'''
#注意看网页的信息,title里的是书的名字,href里的是书的链接
#所以这里书名、书链接的xpath路径是 //a[@class="pic"]/@title、//a[@class="pic"]/@href
#所有class="pic"的a标签下的title属性对应的值,以及href对应的值


#<span class="search_now_price">¥82.80</span>
#书的价格的xpath的路径是 //span[@class="search_now_price"]/text()
#所有class="search_now_price"的span标签的内容文本

'''
<a href="http://search.dangdang.com/?key2=约翰格雷&medium=01&category_path=01.00.00.00.00.00" name="itemlist-author" dd_name="单品作者" title="[美]约翰格雷 博士 著">约翰格雷</a>
'''
#书的作者的xpath的路径为 //a[@dd_name="单品作者"]/text()
#所有的dd_name="单品作者"的a标签下的内容文本

分析完我们要的数据的xpath后,可以进行爬虫文件的编写:

# -*- coding: utf-8 -*-
import scrapy
from DDrun.items import DdrunItem
from scrapy.http import Request


class DdsipderSpider(scrapy.Spider):
    name = 'DDsipder'
    allowed_domains = ['dangdang.com']
    start_urls = (r'http://category.dangdang.com/pg1-cp01.16.00.00.00.00.html',)

    def parse(self, response):
        item = DdrunItem()
        item['title'] = response.xpath(r"//a[@class='pic']/@title").extract()
        item['link'] = response.xpath(r"//a[@class='pic']/@href").extract()
        item['price'] = response.xpath(r"//span[@class='search_now_price']/text()").extract()
        item['author'] = response.xpath(r"//a[@dd_name='单品作者']/text()").extract()
        #提取完后返回item
        yield item
        #通过循环自动爬取19页的数据
        for i in range(1,20):
            url = "http://category.dangdang.com/pg"+str(i)+"-cp01.16.00.00.00.00.html"
            #通过yield返回Request,并指定要爬取的网址和回调函数
            yield Request(url,callback = self.parse)

再接着打开piplines,编写代码对得到的item进行处理:

# -*- coding: utf-8 -*-
import codecs
import json
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class DdrunPipeline(object):
    def __init__(self):
        self.file = codecs.open(r"F:\anacon\spider_scrapy\DDrun\dangdang2.json","wb",encoding = "utf-8")
        
    def process_item(self, item, spider):
        for j in range(0,len(item["title"])):
            name = item["title"][j]
            price = item["price"][j]
            author = item["author"][j]
            link = item["link"][j]
            first_dict = {"name":name,"price":price,"author":author,"link":link}
            i = json.dumps(dict(first_dict), ensure_ascii=False)
            line = i + '\n'
            self.file.write(line)
        return item
    def close_spider(self,spider):
        self.file.close()

编写完pipelines后,需要在settings.py内将piplines功能打开,找到代码,将注释去掉后如下。

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'DDrun.pipelines.DdrunPipeline': 300,
}

也可以在中间件middlewares内编辑代理用户和访问IP:

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
import random
from DDrun.settings import IPPOOL
from DDrun.settings import UAPOOL
from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class IPPOOLS(HttpProxyMiddleware):
    def __init__(self,ip=''):
        self.ip = ip
    def process_request(self,request,spider):
        thisip = random.choice(IPPOOL)
        print("当前使用的IP是:"+thisip['ipaddr'])
        request.meta["proxy"] = r"http://"+thisip["ipaddr "]

class UAPOOLS(UserAgentMiddleware):
    def __init__(self,ua=''):
        self.ua = ua
    def process_request(self,request,spider):
        thisua = random.choice(UAPOOL)
        print("当前使用的用户代理是:"+thisua)
        request.headers.setdefault('User-Agent',thisua)

那么这样的话也需要在settings.py内编辑IP和代理用户,以及打开对应的功能

IPPOOL = [
        {'ipaddr':'116.18.228.55'}
        ]

UAPOOL = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0',
        "Mozilla/5.0 (Windows NT 6.1; rv:48.0) Gecko/20100101 Firefox/48.0",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5"
        
        ]

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    #'DDrun.middlewares.MyCustomDownloaderMiddleware': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':123,
    'DDrun.middlewares.IPPOOLS':125
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':2,
    'DDrun.middlewares.UAPOOLS':1
}

但是发现一个问题,这个IP访问当当网不成功啊,没办法下载网页信息,链接失败,无响应等各种问题。最后把IP的功能关闭,就可以正常访问了。

还需要在settings.py里将cookie和遵守爬虫规则的选项关闭:

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

在cmd下运行爬虫:

scrapy crwal DDsipder --nolog

125818_0hkD_3629884.png

在目录下打开生成的文件125844_K3ME_3629884.png

转载于:https://my.oschina.net/u/3629884/blog/1506805

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值