将metalink中的网页链接用python 提取

最新推荐文章于 2022-11-22 11:31:57 发布

fumingf1

最新推荐文章于 2022-11-22 11:31:57 发布

阅读量531

点赞数

文章标签： Python RE metalink 正则式链接提取

本文链接：https://blog.csdn.net/fumingf1/article/details/79691455

版权

想下TED，下载的到metalink格式的文件，现在都没有工具支持下载，自己动手用python提取吧

（1）问题

原始文件有几千个类似的结构：要把从Https 到MP4的字符串找出来，变成一个list 文件，

<files>
<file name="Bren Brown - The power of vulnerability.mp4">
<resources>
<url type="http">https://download.ted.com/talks/BreneBrown_2010X-low-en.mp4</url>
</resources>
</file>
<file name="Isabel Behncke - Evolutions gift of play from bonobo apes to humans.mp4">
<resources>
<url type="http">https://download.ted.com/talks/IsabelBehnckeIzquierdo_2011U-low-en.mp4</url>
</resources>

</file>

（2）网上找的原始解决方案

https://zhidao.baidu.com/question/560038575.html

results=re.findall("(?isu)(http\://[a-zA-Z0-9\.\?/&\=\:]+)")
open("urls.txt","wb").write("\r\n".joint(results))

（3）调试后的结果：

import re
s=open("TEDEN.TXT","rb").read()
#results=re.findall("(?isu)(https\\://[a-zA-Z0-9\.\?/&\=\:]+)",s)
results=re.findall("(?isu)(https\\://[a-zA-Z0-9 _\-\.\?/&\=\:]+)",s)
with open("OUTPUT.txt","wb") as handle:

handle.write("\r\n".join(results))

（4）输出的文件内容：

https://download.ted.com/talks/BreneBrown_2010X-low-en.mp4
https://download.ted.com/talks/IsabelBehnckeIzquierdo_2011U-low-en.mp4

。。。。。。。

调试成功

（5）回顾

学到了re的符号含义，如何用正则式匹配你要的格式。

.join 和 .joint 的用法