python爬虫requests库_xcrawler一个基于Python requests库的轻量级Web爬虫框架

A light-weight web crawler framework: xcrawler

68747470733a2f2f7777772e7472617669732d63692e6f72672f307845383535314343422f78637261776c65722e7376673f6272616e63683d666561747572652d7265666163746f722d61726368697465637475726568747470733a2f2f636f766572616c6c732e696f2f7265706f732f6769746875622f307845383535314343422f78637261776c65722f62616467652e737667

Introduction

xcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. I'm very interested in web crawling, however, I'm just a newbie to web scraping. I did this so that I can learn more basics of web crawling and Python language.

687474703a2f2f626c6f672e6368726973636162696e2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031372f30392f78637261776c65722d617263682e706e67

Features

Simple: extremely easy to customize your own spider;

Fast: multiple requests are spawned concurrently with the ThreadPoolDownloader or ProcessPoolDownloader;

Flexible: different scheduling strategies are provided -- FIFO/FILO/Priority based;

Extensible: write your own extensions to make your crawler much more powerful.

Install

create a virtual environment for your project, then activate it: virtualenv crawlenv

source crawlenv/bin/activate

download and install this package: pip install git+https://github.com/0xE8551CCB/xcrawler.git

Quick start

Define your own spider: from xcrawler import BaseSpider

class DoubanMovieSpider(BaseSpider):

name = 'douban_movie'

custom_settings = {}

start_urls = ['https://movie.douban.com']

def parse(self, response):

# extract items from response

# yield new requests

# yield new items

pass

Define your own extension: class DefaultUserAgentExtension(object):

config_key = 'DEFAULT_USER_AGENT'

def __init__(self):

self._user_agent = ''

def on_crawler_started(self, crawler):

if self.config_key in crawler.settings:

self._user_agent = crawler.settings[self.config_key]

def process_request(self, request, spider):

if not request or 'User-Agent' in request.headers or not self._user_agent:

return request

logger.debug('[{}]{} adds default user agent: '

'{!r}'.format(spider, request, self._user_agent))

request.headers['User-Agent'] = self._user_agent

return request

Define a pipeline to store scraped items: class JsonLineStoragePipeline(object):

def __init__(self):

self._file = None

def on_crawler_started(self, crawler):

path = crawler.settings.get('STORAGE_PATH', '')

if not path:

raise FileNotFoundError('missing config key: `STORAGE_PATH`')

self._file = open(path, 'a+')

def on_crawler_stopped(self, crawler):

if self._file:

self._file.close()

def process_item(self, item, request, spider):

if item and isinstance(item, dict):

self._file.writeline(json.dumps(item))

Config the crawler: settings = {

'download_timeout': 16,

'download_delay': .5,

'concurrent_requests': 10,

'storage_path': '/tmp/hello.jl',

'default_user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) '

'AppleWebKit/603.3.8 (KHTML, like Gecko) Version'

'/10.1.2 Safari/603.3.8',

'global_extensions': {0: DefaultUserAgentExtension},

'global_pipelines': {0: JsonLineStoragePipeline}

}

crawler = Crawler('DEBUG', **settings)

crawler.crawl(DoubanMovieSpider)

Bingo, you are ready to go now: crawler.start()

License

xcrawler is licensed under the MIT license, please feel free and happy crawling!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值