python提取url的正则表达式_python – 从HTML链接提取URL的正则表达式

如果你只是寻找一个:

import re

match = re.search(r'href=[\'"]?([^\'" >]+)', s)

if match:

print match.group(0)

如果您有一个长字符串,并希望其中的每个模式的实例:

import re

urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)

print ', '.join(urls)

哪里是你正在寻找匹配的字符串。

正则表达式位的快速说明:

r'...' is a “raw” string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially — in a raw string a \ is just a \. In a regular string you’d have to do \\ every time, and that gets old in regexps.)

“href=[\'"]?” says to match “href=”, possibly followed by a ' or ". “Possibly” because it’s hard to say how horrible the HTML you’re looking at is, and the quotes aren’t strictly required.

Enclosing the next bit in “()” says to make it a “group”, which means to split it out and return it separately to us. It’s just a way to say “this is the part of the pattern I’m interested in.”

“[^\'" >]+” says to match any characters that aren’t ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

在另一个使用BeautifulSoup的答案中的建议并不错,但它确实引入了更高级别的外部要求。此外,它不能帮助您达到学习正则表达式的目标,我认为这个具体的html解析项目只是其中的一部分。

这很容易做到:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html_to_parse)

for tag in soup.findAll('a', href=True):

print tag['href']

一旦你安装了BeautifulSoup,反正。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值