Scrpay Day 1

scrapy

scrapy startproject quotes_spider

scrapy genspider quotes quotes.toscrape.com

scrapy shell 

    fetch('http://quotes.toscrape.com/')

    response.xpath('//h1')

scrapy shell 'http://quotes.toscrape.com/'

scrapy list

scrapy crawl qutoes # crawler is quotes.py

scrapy crawl qutoes -o items.csv

scrapy crawl qutoes -o items.json

scrapy crawl qutoes -o items.xml


It is very important to be careful while scraping websites; otherwise, you might be banned. 

1. In the file settings.py activate the option DOWNLOAD_DELAY or you can do that manually in your code through sleeping a for a random number of seconds.

from time import sleep
import random

sleep(random.randrange(1,3))

2. In the file settings.py activeate the option USER_AGENT like the following, or any Chrome or Firefox user agent here. Defining a user agent let you look more like a browser used by a real person, not an automatic robot. 

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"

3. Find external proxies and prtate IP addresses while scraping. You can use the package scrapy-proxies for the purpose. 

https://github.com/aivarsk/scrapy-proxies

4. For professional work, consider using ScrapingHub.com to host your scrapers - it offers a free limited plan.


Before web scraping, it is highly recommended to search for an API for the website you want to get data from. Most large websites offer APIs to make data extraction a better experience for both parties. So try fiirst to search Google for an APi for the website; if you find one, you do not need to scrape it. APIs generate JSON objects which are very similar to Python dictionaries, and from which data can be extracted using the Python JSON library.

Before web scraping, it is highly recommended you read the Terms and Conditions of the website. Some websites clearly mention prohibiting web scraping without permission, or mention some legal or copyright aspects related to the use of its data.

Before web scraping, employ common sense! Some web scraping or other robot activities are obviously illegal if they cause any direct or indirect damage to the company owning data or its customers. It is a good idea to discuss the purpose of a web scraping project with your client before accepting it. 

Before web scraping, prepare your code to be "polite": do not unnecessarily disable robots.txt of the website; space out your request a bit so that you do not hammer the site's server; and it is better t o run your spiders during off-peak traffic hours of the website. 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值