题目：写一个python程序，利用正则表达式，提去一个html页面中的所有超链接，去除html中的标签元素，生成一个文本文件。

最新推荐文章于 2023-06-04 14:49:12 发布

清晨的光明

最新推荐文章于 2023-06-04 14:49:12 发布

阅读量682

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/kdongyi/article/details/89786176

版权

Python 专栏收录该内容

27 篇文章 22 订阅

订阅专栏

题目：写一个python程序，利用正则表达式，提去一个html页面中的所有超链接，去除html中的标签元素，生成一个文本文件。

import re
import urllib
import os

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

def getHref(html):
    text = []
    #html = "<script src=\" http://hm.baidu.com/h.js?3d8e7fc0de8a2a75f2ca3bfe128e6391\" type=\"text/javascript\"></script>"
    http_res = r"(?<=http://).+?(?=\")"
    https_res = r"(?<=https://).+?(?=\")"
    #res = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
    # (?<=exp) 匹配前面满足表达式exp的位置
    # .   匹配除换行符 \n 之外的任何单字符     要匹配 . 请使用 \.
    # +   匹配前面的子表达式一次或多次    例如，'zo+' 能匹配 "zo" 以及 "zoo"，但不能匹配 "z"。+ 等价于 {1,}
    # ?   匹配前面的子表达式零次或一次。例如，"do(es)?" 可以匹配 "do" 、 "does" 中的 "does" 、 "doxy" 中的 "do" 。? 等价于 {0,1}
    # (?=exp) 匹配后面满足表达式exp的位置

    http_urls = re.findall(http_res, html.decode('utf-8'))
    for url in http_urls:
        print("http://"+url)
        text.append("http://"+url)
    https_urls = re.findall(https_res, html.decode('utf-8'))
    for url in https_urls:
        print("https://"+url)
        text.append("https://" + url)

    return text

if __name__ == "__main__":
    url = "http://tieba.baidu.com/"
    #url = input("请输入要抓取的网址：")
    path = "txt"
    html = getHtml(url)
    text = getHref(html)

    folder = os.path.exists(path)
    if not folder:  # 判断是否存在文件夹如果不存在则创建为文件夹
        os.makedirs(path)  # makedirs 创建文件时如果路径不存在会创建这个路径
        print("目录创建成功！")
    file = open(path + '/url' + '.txt', 'w')
    for urls in text:
        file.write(urls + "\r\n")  # 写入内容信息
    file.close()