Python 如何从字符串中提取 URL 链接

最新推荐文章于 2024-08-16 10:14:02 发布

是晨星啊

最新推荐文章于 2024-08-16 10:14:02 发布

阅读量1.6w

点赞数 4

分类专栏： Python学习

本文链接：https://blog.csdn.net/s1162276945/article/details/80738661

版权

Python学习专栏收录该内容

92 篇文章 1 订阅

订阅专栏

问题来源于 xpath 爬虫，我通过 requests 提取到的 HTML 内容为字符串，不是 json 格式，所以需要提取字符串中的 URL，但是这比 json 数据难处理多了。为此在Google上找到了方法。

What’s the cleanest way to extract URLs from a string using Python?

https://stackoverflow.com/questions/520031/whats-the-cleanest-way-to-extract-urls-from-a-string-using-python/44936558#44936558?newreg=a1ad42438aea44d08f387154dbb6891d

由于提取到的超链接里面既有图片，也有文本(这是由urlextract.py 文件决定的，具体的可以参考GitHub网页 https://github.com/lipoja/URLExtract)，我只需要文本的链接，所以需要过滤数据。
Python判断一个字符串是否包含子串的几种方法

https://blog.csdn.net/yl2isoft/article/details/52079960

def get_url():

    with codecs.open('../xinhuanet/汽车_新闻.txt', 'a') as file:
        response = requests.get(homepage, proxies=proxies, headers=headers, params=data)
        print(response.status_code)  # 200
        html = etree.HTML(response.content)
        print(tostring(html).decode())  # 找不到想要的内容
        extractor = URLExtract()
        urls = extractor.find_urls(tostring(html).decode(), only_unique=True)
        # print(urls)
        pc_url = []
        for u in urls:
            flag = ".htm" in u
            if flag is True:
                pc_url.append(u)
                file.writelines(u)
                file.writelines('\n')
    print(pc_url)
    return pc_url