Python网络爬虫——简介

检查 robots.txt

大多数网站都会定义 robots.txt 文件,这样可以让爬虫了解爬取该网站时存在哪些限制。
例如:https://www.baidu.com/robots.txt

检查网站地图

网站提供的 Sitemap 文件(即网站地图)可以帮助爬虫定位网站最新的内容,而无须爬取每一个网页。

识别网站所用技术

安装python builtwith模块

pip install builtwith
C:\Users\chenj>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import builtwith
>>> builtwith.parse('http://www.baidu.com')
{'javascript-frameworks': ['jQuery']}
>>> builtwith.parse('http://example.webscraping.com')
{'web-servers': ['Nginx'], 'web-frameworks': ['Web2py', 'Twitter Bootstrap'], 'programming-languages': ['Python'], 'javascript-frameworks': ['jQuery', 'Modernizr', 'jQuery UI']}
>>>

寻找网站所有者

安装python whois模块

pip install python-whois
C:\Users\chenj>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import whois
>>> print(whois.whois('csdn.net'))
{
  "domain_name": "CSDN.NET",
  "registrar": "NETWORK SOLUTIONS, LLC.",
  "whois_server": "whois.networksolutions.com",
  "referral_url": null,
  "updated_date": [
    "2017-03-10 00:52:46",
    "2018-02-09 01:43:52"
  ],
  "creation_date": "1999-03-11 05:00:00",
  "expiration_date": "2020-03-11 04:00:00",
  "name_servers": [
    "NS3.DNSV3.COM",
    "NS4.DNSV3.COM"
  ],
  "status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
  "emails": [
    "abuse@web.com",
    "Jiangtao@CSDN.NET"
  ],
  "dnssec": "unsigned",
  "name": "Beijing Chuangxin Lezhi Co.ltd",
  "org": "Beijing Chuangxin Lezhi Co.ltd",
  "address": "B3-2-1 ZHaowei Industry Park",
  "city": "Beijng",
  "state": "Beijing",
  "zipcode": "100016",
  "country": "CN"
}
>>>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值