Scrpay Day 1

最新推荐文章于 2024-04-21 13:42:37 发布

CharlesConan

最新推荐文章于 2024-04-21 13:42:37 发布

阅读量217

点赞数

文章标签： Scrapy

scrapy

scrapy startproject quotes_spider

scrapy genspider quotes quotes.toscrape.com

scrapy shell

fetch('http://quotes.toscrape.com/')

response.xpath('//h1')

scrapy shell 'http://quotes.toscrape.com/'

scrapy list

scrapy crawl qutoes # crawler is quotes.py

scrapy crawl qutoes -o items.csv

scrapy crawl qutoes -o items.json

scrapy crawl qutoes -o items.xml

It is very important to be careful while scraping websites; otherwise, you might be banned.

1. In the file settings.py activate the option DOWNLOAD_DELAY or you can do that manually in your code through sleeping a for a random number of seconds.

from time import sleep
import random

sleep(random.randrange(1,3))

2. In the file settings.py activeate the option USER_AGENT like the following, or any Chrome or Firefox user agent here. Defining a user agent let you look more like a browser used by a real person, not an automatic robot.

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"

3. Find external proxies and prtate IP addresses while scraping. You can use the package scrapy-proxies for the purpose.

https://github.com/aivarsk/scrapy-proxies

4. For professional work, consider using ScrapingHub.com to host your scrapers - it offers a free limited plan.

Before web scraping, it is highly recommended to search for an API for the website you want to get data from. Most large websites offer APIs to make data extraction a better experience for both parties. So try fiirst to search Google for an APi for the website; if you find one, you do not need to scrape it. APIs generate JSON objects which are very similar to Python dictionaries, and from which data can be extracted using the Python JSON library.

Before web scraping, it is highly recommended you read the Terms and Conditions of the website. Some websites clearly mention prohibiting web scraping without permission, or mention some legal or copyright aspects related to the use of its data.

Before web scraping, employ common sense! Some web scraping or other robot activities are obviously illegal if they cause any direct or indirect damage to the company owning data or its customers. It is a good idea to discuss the purpose of a web scraping project with your client before accepting it.

Before web scraping, prepare your code to be "polite": do not unnecessarily disable robots.txt of the website; space out your request a bit so that you do not hammer the site's server; and it is better t o run your spiders during off-peak traffic hours of the website.

CharlesConan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrpay Day 1

scrapyscrapy startproject quotes_spiderscrapy genspider quotes quotes.toscrape.comscrapy shell fetch('http://quotes.toscrape.com/') response.xpath('//h1')scrapy shell 'http://quotes.toscrape.co...
复制链接

扫一扫