![](https://img-blog.csdnimg.cn/20201014180756724.png?x-oss-process=image/resize,m_fixed,h_64,w_64)
python
猎狐肥
这个作者很懒,什么都没留下…
展开
-
python builtwith
import builtwith res = builtwith.parse('https://zhidao.baidu.com/question/2073804096754701028.html') print res output: {u'javascript-frameworks': [u'RequireJS', u'jQuery', u'RightJS'], u'web-serv原创 2017-01-14 15:28:24 · 511 阅读 · 0 评论 -
downLoad
下载:避免网页错误;错误码 5xx; 代理 def download_page(url, num_retries = 2, proxy=None, referer=None): page_buf = '' print 'downloading:', url try: # set http proxy if proxy:转载 2017-01-14 22:10:45 · 481 阅读 · 0 评论 -
网站地图爬虫
def crawl_sitemap(url): html = '' #download the sitemap file sitemap = download_page(url, 2) # extract the sitemap links links = re.findall('(.*?)',sitemap) #load each link转载 2017-01-14 22:16:30 · 785 阅读 · 0 评论