python爬虫requests库_xcrawler一个基于Python requests库的轻量级Web爬虫框架

最新推荐文章于 2024-06-07 09:58:27 发布

weixin_39791653

最新推荐文章于 2024-06-07 09:58:27 发布

阅读量76

点赞数

文章标签： python爬虫requests库

A light-weight web crawler framework: xcrawler

68747470733a2f2f7777772e7472617669732d63692e6f72672f307845383535314343422f78637261776c65722e7376673f6272616e63683d666561747572652d7265666163746f722d617263686974656374757265 68747470733a2f2f636f766572616c6c732e696f2f7265706f732f6769746875622f307845383535314343422f78637261776c65722f62616467652e737667

Introduction

xcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. I'm very interested in web crawling, however, I'm just a newbie to web scraping. I did this so that I can learn more basics of web crawling and Python language.

687474703a2f2f626c6f672e6368726973636162696e2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031372f30392f78637261776c65722d617263682e706e67

Features

Simple: extremely easy to customize your own spider;

Fast: multiple requests are spawned concurrently with the ThreadPoolDownloader or ProcessPoolDownloader;

Flexible: different scheduling strategies are provided -- FIFO/FILO/Priority based;

Extensible: write your own extensions to make your crawler much more powerful.

Install

create a virtual environment for your project, then activate it: virtualenv crawlenv

source crawlenv/bin/activate

download and install this package: pip install git+https://github.com/0xE8551CCB/xcrawler.git

Quick start

Define your own spider: from xcrawler import BaseSpider

class DoubanMovieSpider(BaseSpider):

name = 'douban_movie'

custom_settings = {}

start_urls = ['https://movie.douban.com']

def parse(self, response):

# extract items from response

# yield new requests

# yield new items

pass

Define your own extension: class DefaultUserAgentExtension(object):

config_key = 'DEFAULT_USER_AGENT'

def __init__(self):

self._user_agent = ''

def on_crawler_started(self, crawler):

if self.config_key in crawler.settings:

self._user_agent = crawler.settings[self.config_key]

def process_request(self, request, spider):

if not request or 'User-Agent' in request.headers or not self._user_agent:

return request

logger.debug('[{}]{} adds default user agent: '

'{!r}'.format(spider, request, self._user_agent))

request.headers['User-Agent'] = self._user_agent

return request

Define a pipeline to store scraped items: class JsonLineStoragePipeline(object):

def __init__(self):

self._file = None

def on_crawler_started(self, crawler):

path = crawler.settings.get('STORAGE_PATH', '')

if not path:

raise FileNotFoundError('missing config key: `STORAGE_PATH`')

self._file = open(path, 'a+')

def on_crawler_stopped(self, crawler):

if self._file:

self._file.close()

def process_item(self, item, request, spider):

if item and isinstance(item, dict):

self._file.writeline(json.dumps(item))

Config the crawler: settings = {

'download_timeout': 16,

'download_delay': .5,

'concurrent_requests': 10,

'storage_path': '/tmp/hello.jl',

'default_user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) '

'AppleWebKit/603.3.8 (KHTML, like Gecko) Version'

'/10.1.2 Safari/603.3.8',

'global_extensions': {0: DefaultUserAgentExtension},

'global_pipelines': {0: JsonLineStoragePipeline}

}

crawler = Crawler('DEBUG', **settings)

crawler.crawl(DoubanMovieSpider)

Bingo, you are ready to go now: crawler.start()

License

xcrawler is licensed under the MIT license, please feel free and happy crawling!

weixin_39791653

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫requests库_xcrawler一个基于Python requests库的轻量级Web爬虫框架

A light-weight web crawler framework: xcrawler Introductionxcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. I'm very...
复制链接

扫一扫