scrapy-1:创建一个项目并采集

新建项目

 

[root@localhost pytest]# scrapy startproject iteye886           
New Scrapy project 'iteye886', using template directory '/usr/lib64/python2.7/site-packages/scrapy/templates/project', created in:
    /root/pytest/iteye886

You can start your first spider with:
    cd iteye886
    scrapy genspider example example.com
[root@localhost pytest]# cd iteye886/
[root@localhost iteye886]# scrapy genspider myblog 886.iteye.com
Created spider 'myblog' using template 'basic' in module:
  iteye886.spiders.myblog
[root@localhost iteye886]# tree
.
├── iteye886
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   ├── settings.pyc
│   └── spiders
│       ├── __init__.py
│       ├── __init__.pyc
│       └── myblog.py
└── scrapy.cfg

2 directories, 10 files

 编写需要采集的字段

 

[root@localhost iteye886]# vim iteye886/items.py 

  1 # -*- coding: utf-8 -*-    
  2         
  3 # Define here the models for your scraped items
  4 #       
  5 # See documentation in:    
  6 # http://doc.scrapy.org/en/latest/topics/items.html
  7         
  8 import scrapy              
  9         
 10         
 11 class Iteye886Item(scrapy.Item):
 12     # define the fields for your item here like:
 13     # name = scrapy.Field()
 14      title = scrapy.Field()#设置要获取的item
 15      link = scrapy.Field()
 16                                 

 编辑代码

 

[root@localhost iteye886]# vim iteye886/spiders/myblog.py 

 

 

 

  1 # -*- coding: utf-8 -*-      
  2 import scrapy                
  3 from iteye886.items import Iteye886Item     #导入item                  
  4 class MyblogSpider(scrapy.Spider):                            
  5     name = "myblog"          
  6     allowed_domains = ["886.iteye.com"]                       
  7     start_urls = (           
  8         'http://886.iteye.com/',     #删除www                     
  9     )                        
 10                              
 11     def parse(self, response):                                
 12         lis = response.xpath('//*[@id="main"]/div/div[1]/h3/a') #增加xpath
 13         item = Iteye886Item()                                                                                                                
 14         for li in lis:       
 15             item['title']=response.xpath('a/text()').extract()
 16             item['link']=response.xpath('a/@href').extract()
 17             yield item 

 

[root@localhost iteye886]# scrapy list
myblog
[root@localhost iteye886]# scrapy crawl myblog -o abc.csv
[root@localhost iteye886]# cat abc.csv 
link,title
/blog/2324590,centos7-python:交互界面tab补齐
/blog/2324577,scrapy-0:centos7安装scrapy
....

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值