查看网页robots协议
在网页后加上/robots.txt
例:https://www.baidu.com/robots.txt
User-agent:表明是哪些爬虫
Disallow:表示哪些区域不能被该爬虫进入
其中,“*”代表全部
并非所有网站都有robost协议
【实例练习】
【实例1】京东商品页面的爬取
>>> import requests
>>> r=requests.get('https://item.jd.com/2967929.html')
>>> r.encoding=r.apparent_encoding
>>> r.status_code
200
>>> r.text
'<!DOCTYPE HTML>\n<html lang="zh-CN">\n<head>\n <!-- shouji -->\n <meta http-equiv="Content-Type" content="text/html; charset=gbk" />\n <title>【华为荣耀8】荣耀8 4GB+64GB 全网通4G手机 魅海蓝【行情 报价 价格 评测】-京东</title>\n <meta name="keywords" content="HUAWEI荣耀8,华为荣耀8,华为荣耀8报价,HUAWEI荣耀8报价"/>\n <meta name="description" content="【华为荣耀8】京东JD.COM提供华为荣耀8正品行货,并包括HUAWEI荣耀8网购指南,以及华为荣耀8图片、荣耀8参数、荣耀8评论、荣耀8心得、荣耀8技巧等信息,网购华为荣耀8上京东,放心又轻松" window.showtouchurl = true;\n return;\n }\n\n if (/MOBILE/.test(userAgent) && /(MICROMESSENGER|QQ\\/)/.test(userAgent)) {
\n var paramIndex = location.href.indexOf("?");\n href="//item.jd.com/100004885513.html" target="_blank" clstag="pageclick|keycount|shop_link_124259979_51|1000000904"> \n\t\t\t\t\t\t\t\t\t\t\t\t<div class="user-chi-img">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img class="" src="//img10.360buyimg.com/cms/jfs/t1/71825/14/11654/10052/5d902323Ed//honor.jd.com" target="_blank" class="btn-def enter-shop J-enter-shop" clstag="shangpin|keycount|product|jindian2">\n <i class="sprite-enter"></i>\n <span>进店逛逛</span>\n </a>\n <a href="#none" class="btn-def follow-shop J-follow-shop" data-vid="1000000904" clstag="shangpin|keycount|product|guanzhu2">\n <i class="sprite-follow"> </i>\n <span>关注店铺</span>\n </a>\n </div>\n </div>\n </div>\n </div>\n </div>\n <div class="m m-aside hide" id="view-buy" clstag="shangpin|keycount|product|darenxuangou_1"></div>\n\n <div class="m m-aside" id="view-view" clstag="shangpin|keycount|product|seemore_1"></div>\n <div class="m m-aside" id="rank">\n <div class="mt">\n <h3>手机热销榜</h3>\n &l