用python库调查网站背景
了解网站的背景信息,比如
- 网站地图
- 网站大小
- 网站所用的架构
- 网站所有者
网站地图
网站提供的Sitemap文件(即网站地图)可以帮助爬虫定位网站最新的内容,而无须爬取每一个网页。如果想要了解更多信息,可以从http://www.sitemaps.org/protocol.html 获取网站地图标准的定义。
估算网站大小
网站所用的架构
PC:~/Project/python$ sudo pip install builtwith
PC:~/Project/python$ python2.7
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import builtwith
>>> builtwith.parse('http://www.baidu.com')
{u'javascript-frameworks': [u'jQuery']}
寻找网站所有者
PC:~/Project/python$ sudo pip install python-whois
Collecting python-whois
Downloading python-whois-0.6.5.tar.gz
Collecting future (from python-whois)
Downloading future-0.16.0.tar.gz (824kB)
100% |████████████████████████████████| 829kB 94kB/s
Installing collected packages: future, python-whois
Running setup.py install for future ... done
Running setup.py install for python-whois ... done
Successfully installed future-0.16.0 python-whois-0.6.5
PC:~/Project/python$ python2.7
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import whois
>>> print whois.whois('zhaozhoutea.com')
{
"updated_date": [
"2017-04-04 00:00:00",
"2017-04-04 10:29:46"
],
"status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
"name": "Talos Gabor",
"dnssec": "unsigned",
"city": "Budapest",
"expiration_date": [
"2018-04-04 00:00:00",
"2018-04-04 04:00:00"
],
"zipcode": "1022",
"domain_name": [
"ZHAOZHOUTEA.COM",
"zhaozhoutea.com"
],
"country": "HU",
"whois_server": "whois.onlinenic.com",
"state": "Pest megye",
"registrar": "Onlinenic Inc",
"referral_url": "http://www.onlinenic.com",
"address": "Herman Otto u. 25/A.",
"name_servers": [
"NS1.E-TIGER.NET",
"NS2.NS0.HU",
"ns1.e-tiger.net",
"ns2.ns0.hu"
],
"org": "Talos Gabor",
"creation_date": [
"2014-04-04 00:00:00",
"2014-04-04 04:00:00"
],
"emails": [
"onlinenic-enduser@onlinenic.com",
"mediacenter@mediacenter.hu"
]
}
>>>