python用正则表达式提取超链接_使用正则表达式重新字符串匹配提取URL链接-Python...

最新推荐文章于 2021-07-05 14:01:41 发布

weixin_39576018

最新推荐文章于 2021-07-05 14:01:41 发布

阅读量253

点赞数

文章标签： python用正则表达式提取超链接

I've been trying to extract URLs from a text file using re api. any link that starts with http:// , https:// and www.

the file contains texts as well as html source code, html part is easy because i can extract them using BeautifulSoup, but normal text seems to be more challenging.

I found this online which seems to be the best implementation of URL extraction however it fails on certain tags, specially it can't handle tags and includes them in the URL.

any help is appreciated, because I'm not familiar with string matching at all myself

here is the signature

sp1=re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", str(STRING))

sp2=re.findall('www.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(STRING))

examples:

http://www.website.com/science/