ABOUT
This is the base implementation of a full crawler that uses a spacetime cache server to receive requests.
CONFIGURATION
Step 1: Install dependencies
If you do not have Python 3.6+:
Check if pip is installed by opening up a terminal/command prompt and typing the commands python3 -m pip. This should show the help menu for all the commands possible with pip. If it does not, then get pip by following the instructions at https://pip.pypa.io/en/stable/installing/
To install the dependencies for this project run the following two commands after ensuring pip is installed for the version of python you are using. Admin privileges might be required to execute the commands. Also make sure that the terminal is at the root folder of this project.python -m pip install packages/spacetime-2.1.1-py3-none-any.whl
python -m pip install -r packages/requirements.txt
Use this command to pip install BeautifulSoup4
MAC: sudo python3 -m pip install bs4
Step 2: Configuring config.ini
Set the options in the config.ini file. The following configurations exist.
USERAGENT: Set the useragent to IR F19 uci-id1,uci-id2,uci-id3. It is important to set the useragent appropriately to get the credit for hitting our cache.
HOST: This is the host name of our caching server. Please set it as per spec.
PORT: THis is the port number of our caching server. Please set it as per spec.
SEEDURL: The starting url that a crawler first starts downloading.
POLITENESS: The time delay each thread has to wait for after each download.
SAVE: Th