读书笔记（7）——scrapy进阶

最新推荐文章于 2024-08-29 10:04:09 发布

weixin_34318956

最新推荐文章于 2024-08-29 10:04:09 发布

阅读量83

点赞数

文章标签： python 爬虫 json

原文链接：https://my.oschina.net/u/3629884/blog/1506805

版权

2019独角兽企业重金招聘Python工程师标准>>>

之前运行爬虫文件的时候碰到问题，通过查阅百度，需要在手动安装该.py文件（在scripy文件夹内）

首先使用一下命令生成一个名为DDrun的爬虫项目

scrapy startproject DDrun

cd到DDrun\DDrun\spiders目录下，然后输入一下命令，创建一个爬虫文件。

scrapy genspider -t basic DDsipder dangdang.com

创建爬虫项目及爬虫文件后，先在items.py编写所需要的容器：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DdrunItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #定义title存储商品标题
    title = scrapy.Field()
    #定义link存储商品链接
    link = scrapy.Field()
    #定义price存储商品价格
    price = scrapy.Field()
    #定义author存储商品作者
    author = scrapy.Field()

再在爬虫文件（DDsipder.py）编辑爬虫代码，在这里我们首先分析要爬取的网页的HTML:

http://category.dangdang.com/cp01.16.00.00.00.00.html
#这是当当网两性书籍的第一页的网址
http://category.dangdang.com/pg2-cp01.16.00.00.00.00.html
#这是当当网两性书籍的第二页的网址，我们可以发现第一页和第二页主要少了一个"pg2-"
#将第一页的网址改成：
http://category.dangdang.com/pg1-cp01.16.00.00.00.00.html
#打开此网址，我们发现这就是当当网的两性书籍的第一页。所以可以判定，每一个只是pg后的值不一样

'''
<a class="pic" title=" 男人来自火星，女人来自金星（1-4套装）（升级版） " ddclick="act=normalResult_picture&pos=23706011_0_2_p" name="itemlist-picture" dd_name="单品图片" href="http://product.dangdang.com/23706011.html" target="_blank">
'''
#注意看网页的信息，title里的是书的名字，href里的是书的链接
#所以这里书名、书链接的xpath路径是 //a[@class="pic"]/@title、//a[@class="pic"]/@href
#所有class="pic"的a标签下的title属性对应的值，以及href对应的值


#<span class="search_now_price">¥82.80</span>
#书的价格的xpath的路径是 //span[@class="search_now_price"]/text()
#所有class="search_now_price"的span标签的内容文本

'''
<a href="http://search.dangdang.com/?key2=约翰格雷&medium=01&category_path=01.00.00.00.00.00" name="itemlist-author" dd_name="单品作者" title="[美]约翰格雷 博士　著">约翰格雷</a>
'''
#书的作者的xpath的路径为 //a[@dd_name="单品作者"]/text()
#所有的dd_name="单品作者"的a标签下的内容文本

分析完我们要的数据的xpath后，可以进行爬虫文件的编写：

# -*- coding: utf-8 -*-
import scrapy
from DDrun.items import DdrunItem
from scrapy.http import Request


class DdsipderSpider(scrapy.Spider):
    name = 'DDsipder'
    allowed_domains = ['dangdang.com']
    start_urls = (r'http://category.dangdang.com/pg1-cp01.16.00.00.00.00.html',)

    def parse(self, response):
        item = DdrunItem()
        item['title'] = response.xpath(r"//a[@class='pic']/@title").extract()
        item['link'] = response.xpath(r"//a[@class='pic']/@href").extract()
        item['price'] = response.xpath(r"//span[@class='search_now_price']/text()").extract()
        item['author'] = response.xpath(r"//a[@dd_name='单品作者']/text()").extract()
        #提取完后返回item
        yield item
        #通过循环自动爬取19页的数据
        for i in range(1,20):
            url = "http://category.dangdang.com/pg"+str(i)+"-cp01.16.00.00.00.00.html"
            #通过yield返回Request,并指定要爬取的网址和回调函数
            yield Request(url,callback = self.parse)

再接着打开piplines，编写代码对得到的item进行处理：

# -*- coding: utf-8 -*-
import codecs
import json
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class DdrunPipeline(object):
    def __init__(self):
        self.file = codecs.open(r"F:\anacon\spider_scrapy\DDrun\dangdang2.json","wb",encoding = "utf-8")
        
    def process_item(self, item, spider):
        for j in range(0,len(item["title"])):
            name = item["title"][j]
            price = item["price"][j]
            author = item["author"][j]
            link = item["link"][j]
            first_dict = {"name":name,"price":price,"author":author,"link":link}
            i = json.dumps(dict(first_dict), ensure_ascii=False)
            line = i + '\n'
            self.file.write(line)
        return item
    def close_spider(self,spider):
        self.file.close()

编写完pipelines后，需要在settings.py内将piplines功能打开，找到代码，将注释去掉后如下。

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'DDrun.pipelines.DdrunPipeline': 300,
}

也可以在中间件middlewares内编辑代理用户和访问IP：

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
import random
from DDrun.settings import IPPOOL
from DDrun.settings import UAPOOL
from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class IPPOOLS(HttpProxyMiddleware):
    def __init__(self,ip=''):
        self.ip = ip
    def process_request(self,request,spider):
        thisip = random.choice(IPPOOL)
        print("当前使用的IP是："+thisip['ipaddr'])
        request.meta["proxy"] = r"http://"+thisip["ipaddr "]

class UAPOOLS(UserAgentMiddleware):
    def __init__(self,ua=''):
        self.ua = ua
    def process_request(self,request,spider):
        thisua = random.choice(UAPOOL)
        print("当前使用的用户代理是："+thisua)
        request.headers.setdefault('User-Agent',thisua)

那么这样的话也需要在settings.py内编辑IP和代理用户，以及打开对应的功能

IPPOOL = [
        {'ipaddr':'116.18.228.55'}
        ]

UAPOOL = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0',
        "Mozilla/5.0 (Windows NT 6.1; rv:48.0) Gecko/20100101 Firefox/48.0",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5"
        
        ]

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    #'DDrun.middlewares.MyCustomDownloaderMiddleware': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':123,
    'DDrun.middlewares.IPPOOLS':125
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':2,
    'DDrun.middlewares.UAPOOLS':1
}

但是发现一个问题，这个IP访问当当网不成功啊，没办法下载网页信息，链接失败，无响应等各种问题。最后把IP的功能关闭，就可以正常访问了。

还需要在settings.py里将cookie和遵守爬虫规则的选项关闭：

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

在cmd下运行爬虫：

scrapy crwal DDsipder --nolog

在目录下打开生成的文件

转载于:https://my.oschina.net/u/3629884/blog/1506805

weixin_34318956

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
读书笔记（7）——scrapy进阶

2019独角兽企业重金招聘Python工程师标准>>> ...
复制链接

扫一扫