python——爬取网页和实例练习（3）

最新推荐文章于 2022-12-23 15:16:18 发布

luli_ya

最新推荐文章于 2022-12-23 15:16:18 发布

阅读量1.6k

点赞数

文章标签： python https url

本文链接：https://blog.csdn.net/luli_ya/article/details/104100020

版权

查看网页robots协议

在网页后加上/robots.txt
例：https://www.baidu.com/robots.txt
在这里插入图片描述
User-agent:表明是哪些爬虫
Disallow:表示哪些区域不能被该爬虫进入
其中，“*”代表全部
并非所有网站都有robost协议

【实例练习】

【实例1】京东商品页面的爬取

>>> import requests
>>> r=requests.get('https://item.jd.com/2967929.html')
>>> r.encoding=r.apparent_encoding
>>> r.status_code
200
>>> r.text
'<!DOCTYPE HTML>\n<html lang="zh-CN">\n<head>\n    <!-- shouji -->\n    <meta http-equiv="Content-Type" content="text/html; charset=gbk" />\n    <title>【华为荣耀8】荣耀8 4GB+64GB 全网通4G手机 魅海蓝【行情 报价 价格 评测】-京东</title>\n    <meta name="keywords" content="HUAWEI荣耀8,华为荣耀8,华为荣耀8报价,HUAWEI荣耀8报价"/>\n    <meta name="description" content="【华为荣耀8】京东JD.COM提供华为荣耀8正品行货，并包括HUAWEI荣耀8网购指南，以及华为荣耀8图片、荣耀8参数、荣耀8评论、荣耀8心得、荣耀8技巧等信息，网购华为荣耀8上京东,放心又轻松"     window.showtouchurl = true;\n                return;\n              }\n\n                            if (/MOBILE/.test(userAgent) && /(MICROMESSENGER|QQ\\/)/.test(userAgent)) {
   \n                  var paramIndex = location.href.indexOf("?");\n                  href="//item.jd.com/100004885513.html" target="_blank" clstag="pageclick|keycount|shop_link_124259979_51|1000000904"> \n\t\t\t\t\t\t\t\t\t\t\t\t<div class="user-chi-img">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img class="" src="//img10.360buyimg.com/cms/jfs/t1/71825/14/11654/10052/5d902323Ed//honor.jd.com" target="_blank" class="btn-def enter-shop J-enter-shop" clstag="shangpin|keycount|product|jindian2">\n                    <i class="sprite-enter"></i>\n                    <span>进店逛逛</span>\n                </a>\n                <a href="#none" class="btn-def follow-shop J-follow-shop" data-vid="1000000904" clstag="shangpin|keycount|product|guanzhu2">\n                    <i class="sprite-follow"> </i>\n                    <span>关注店铺</span>\n                </a>\n            </div>\n        </div>\n    </div>\n    </div>\n        </div>\n                                                        <div class="m m-aside hide" id="view-buy" clstag="shangpin|keycount|product|darenxuangou_1"></div>\n\n                        <div class="m m-aside" id="view-view" clstag="shangpin|keycount|product|seemore_1"></div>\n                                        <div class="m m-aside" id="rank">\n            <div class="mt">\n                <h3>手机热销榜</h3>\n            &l

最低0.47元/天解锁文章

luli_ya

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python——爬取网页和实例练习（3）

查看网页robots协议在网页后加上/robots.txt例：https://www.baidu.com/robots.txtUser-agent:表明是哪些爬虫Disallow:表示哪些区域不能被该爬虫进入其中，“*”代表全部并非所有网站都有robost协议【实例练习】【实例1】京东商品页面的爬取>>> import requests>>&gt...
复制链接

扫一扫