以 oschina 为例:
生成项目
$ scrapy startproject oschina
$ cd oschina
配置 编辑 settings.py, 加入以下(主要是User-agent和piplines):
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
LOG_LEVEL = 'ERROR'
RETRY_ENABLED = False
DOWNLOAD_TIMEOUT = 10
ITEM_PIPELINES = {
'oschina.pipelines.SomePipeline': 300,
}
编辑 items.py, 内容如下:
# -*- coding: utf-8 -*-
import scrapy
class OschinaItem(scrapy.Item):
Link = scrapy.Field()
LinkText = scrapy.Field()
编辑 pipelines.py, 内容如下:
# -*- coding: utf-8 -*-
import json
from scrapy.exceptions import DropItem
class OschinaPipeline(object):
def __init__(self):
self.file = open('result.jl', 'w')
self.seen = set(