淘宝 代写 python_python爬虫代写 代做python爬虫

该博客介绍了如何配置和运行一个使用spacetime缓存服务器的Python爬虫项目。首先安装依赖,然后配置config.ini文件,包括设置USERAGENT、HOST、PORT等参数。接着定义scraper规则,从响应中提取新的URL并过滤无效链接。最后,了解项目的执行流程、可自定义的Frontier和Worker接口,并注意遵循礼貌策略和正确设置USERAGENT。
摘要由CSDN通过智能技术生成

ABOUT

This is the base implementation of a full crawler that uses a spacetime cache server to receive requests.

CONFIGURATION

Step 1: Install dependencies

If you do not have Python 3.6+:

Check if pip is installed by opening up a terminal/command prompt and typing the commands python3 -m pip. This should show the help menu for all the commands possible with pip. If it does not, then get pip by following the instructions at https://pip.pypa.io/en/stable/installing/

To install the dependencies for this project run the following two commands after ensuring pip is installed for the version of python you are using. Admin privileges might be required to execute the commands. Also make sure that the terminal is at the root folder of this project.python -m pip install packages/spacetime-2.1.1-py3-none-any.whl

python -m pip install -r packages/requirements.txt

Use this command to pip install BeautifulSoup4

MAC: sudo python3 -m pip install bs4

Step 2: Configuring config.ini

Set the options in the config.ini file. The following configurations exist.

USERAGENT: Set the useragent to IR F19 uci-id1,uci-id2,uci-id3. It is important to set the useragent appropriately to get the credit for hitting our cache.

HOST: This is the host name of our caching server. Please set it as per spec.

PORT: THis is the port number of our caching server. Please set it as per spec.

SEEDURL: The starting url that a crawler first starts downloading.

POLITENESS: The time delay each thread has to wait for after each download.

SAVE: Th

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值