scrapy实战--爬取报刊名称及地址

目标:爬取全国报刊名称及地址

链接:http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm

目的:练习scrapy爬取数据

 

学习过scrapy的基本使用方法后,我们开始写一个最简单的爬虫吧。

目标截图:

 

  1、创建爬虫工程

$ cd ~/code/crawler/scrapyProject
$ scrapy startproject newSpapers

  2、创建爬虫程序

$ cd newSpapers/
$ scrapy genspider nationalNewspaper news.xinhuanet.com 

  3、配置数据爬取项 

$ cat items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class NewspapersItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    addr = scrapy.Field()

 4、 配置爬虫程序

$ cat spiders/nationalNewspaper.py
# -*- coding: utf-8 -*-
import scrapy
from newSpapers.items import NewspapersItem

class NationalnewspaperSpider(scrapy.Spider):
    name = "nationalNewspaper"
    allowed_domains = ["news.xinhuanet.com"]
    start_urls = ['http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm']

    def parse(self, response):
        sub_country = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[2]')
        sub2_local = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[4]')
        tags_a_country = sub_country.xpath('./td/table/tbody/tr/td/p/a')
        items = []
        for each in tags_a_country:
            item = NewspapersItem()
            item['name'] = each.xpath('./strong/text()').extract()
            item['addr'] = each.xpath('./@href').extract()
            items.append(item)
        return items

  5、配置谁去处理爬取结果

$ cat settings.py
……
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
ITEM_PIPELINES = {'newSpapers.pipelines.NewspapersPipeline':100}

  6、配置数据处理程序

$ cat pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import time
class NewspapersPipeline(object):
    def process_item(self, item, spider):
        now = time.strftime('%Y-%m-%d',time.localtime())
        filename = 'newspaper.txt'
        print '================='
        print item
        print '================'
        with open(filename,'a') as fp:
            fp.write(item['name'][0].encode("utf8")+ '\t' +item['addr'][0].encode("utf8") + '\n')
        return item

  7、查看结果

$ cat spiders/newspaper.txt 
人民日报	http://paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm
海外版	http://paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm
光明日报	http://www.gmw.cn/01gmrb/2007-09/20/default.htm
经济日报	http://www.economicdaily.com.cn/no1/
解放军报	http://www.gmw.cn/01gmrb/2007-09/20/default.htm
中国日报	http://pub1.chinadaily.com.cn/cdpdf/cndy/

  

程序源代码:

 

转载于:https://www.cnblogs.com/kongzhagen/p/6381306.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值