python 解析页面内嵌链接，并访问是否正常

liuy5277

于 2021-07-02 14:57:22 发布

阅读量2.9w

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/liuy5277/article/details/118415572

版权

python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

# coding=utf-8

import time
import urllib.request

from bs4 import BeautifulSoup

t = time.time()


def scanpage(url, suburl):
    websiteurl = url
    t = time.time()
    n = 0
    html = urllib.request.urlopen(websiteurl).read()
    soup = BeautifulSoup(html, "lxml")
    Upageurls = {}
    pageurls = soup.find_all("a", href=True)
    for links in pageurls:
        # print(links.get("href"))
        if suburl in links.get("href") and links.get("href") not in Upageurls:
            Upageurls[links.get("href")] = 0
    for links in Upageurls.keys():
        print(n, links, end='')
        try:
            urllib.request.urlopen(links).getcode()
        except:
            print("connect failed")
        else:
            t2 = time.time()
            print(urllib.request.urlopen(links).getcode(), ' ', end='执行时间为: ')
            t1 = time.time()
            print(round((t1 - t2), 2))
        n += 1
    print("total is " + repr(n) + " links, 供执行时间为: ", round((time.time() - t), 2), 's')


scanpage("http://news.baidu.com", "baidu.com")

结果：

liuy5277

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 解析页面内嵌链接，并访问是否正常

import timeimport urllib.requestfrom bs4 import BeautifulSoupt = time.time()def scanpage(url, suburl): websiteurl = url t = time.time() n = 0 html = urllib.request.urlopen(websiteurl).read() soup = BeautifulSoup(html, "lxml") .
复制链接

扫一扫

专栏目录