1.4.2python网站地图爬虫（每天一更）

最新推荐文章于 2020-12-04 07:35:21 发布

weixin_30621919

最新推荐文章于 2020-12-04 07:35:21 发布

阅读量180

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/xww115/p/10828446.html

版权

# -*- coding: utf-8 -*-
'''
Created on 2019年5月6日

@author: 薛卫卫
'''

import urllib.request
import re

def download(url, user_agent="wswp",num_retries=2):
    print("Downloading: " , url)
    headers = { 'User-agent': user_agent}
    request = urllib.request.Request(url, headers=headers)
    try:
        html = urllib.request.urlopen(request).read()
    except urllib.request.URLError as e:
        print('Download error:' , e.reason)
        html = None
        if num_retries > 0 :
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, user_agent, num_retries-1)
    return html

def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # 不修改正则表达式，修改输出的结果，将urlopen().read()返回的data进行解码
    sitemap = sitemap.decode('utf-8')
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    #download each link
    for link in links:
        html = download(link)
        # scrape html here
        # ...
        
crawl_sitemap("http://example.webscraping.com/sitemap.xml")

转载于:https://www.cnblogs.com/xww115/p/10828446.html

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

weixin_30621919

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
1.4.2python网站地图爬虫（每天一更）

# -*- coding: utf-8 -*-'''Created on 2019年5月6日@author: 薛卫卫'''import urllib.requestimport redef download(url, user_agent="wswp",num_retries=2): print("Downloading: " , url)...
复制链接

扫一扫