学scrapy有一段时间了,今天就抓取一个段子来总结一下,安装scrapy请参考上一篇。
首先在dos命令下创建scrapy
scrapy startproject myspider
scrapy genspider nhsq "nhanshequ.com"
这样框架就搭建好了。
T:.
│ scrapy.cfg
│
└─tutorial
│ items.py
│ pipelines.py
│ settings.py
│ __init__.py
│
└─spiders
__init__.py
nhsq.py
这些文件主要是:
- scrapy.cfg: 项目配置文件
- myspider/: 项目python模块, 呆会代码将从这里导入
- myspider/items.py: 项目items文件
- myspider/pipelines.py: 项目管道文件
- myspider/settings.py: 项目配置文件
- myspider/spiders: 放置spider的目录
内面是由动态加载的,需要的内容都在动态文件里面,需要抓包拿到url ,
http://neihanshequ.com/joke/?is_json=1&app_name=neihanshequ_web&max_time=时间戳
最后加一个时间戳来构建url
定义item
定义需要抓取的字段
import scrapy
class MyspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
create_time = scrapy.Field()
content = scrapy.Field()
digg_count = scrapy.Field()
favorite_count = scrapy.Field()
comment_count = scrapy.Field()
author = scrapy.Field()
打开spider/nhsq.py 开始写爬虫代码
# -*- coding: utf-8 -*-
import scrapy
import time
import json
from myspider.items import MyspiderItem
import sys
class NhsqSpider(scrapy.Spider):
name = 'nhsq'
allowed_domains = ['neihanshequ.com']
start_urls = ['http://neihanshequ.com/']
def start_requests(self):#内置,默认首先运行
url = 'http://neihanshequ.com/joke/?is_json=1&app_name=neihanshequ_web&max_time={}'.format(int(time.time()))
print 'url:',url
yield scrapy.Request(url,callback=self.parse)
def parse(self, response):
items = MyspiderItem()
result = json.loads(response.text)
data = result.get('data').get('data')
for i in range(20):
items['content'] = data[i].get('group').get('content')
items['create_time'] = data[i].get('group').get('create_time')
items['digg_count'] = data[i].get('group').get('digg_count')
items['favorite_count'] = data[i].get('group').get('favorite_count')
items['comment_count'] = data[i].get('group').get('comment_count')
items['author'] = data[i].get('group').get('user').get('name')
yield items
存储信息 pipelines
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
class MyspiderPipeline(object):
def __init__(self):
self.file = open('duanzi.json','wb')
def process_item(self, item, spider):
# print (item['content'])
content = json.dumps(dict(item),ensure_ascii=False)+"\n"
self.file.write(content)
return item
不要忘了setting 设置打开
ITEM_PIPELINES = {
'mylovespider.pipelines.MylovespiderPipeline': 300,
}
dos命令下运行
scrapy crawl nhsq
目录下就会生成一个duanzi.json文件。