想下TED,下载的到metalink格式的文件,现在都没有工具支持下载,自己动手用python提取吧
(1)问题
原始文件有几千个类似的结构: 要把从Https 到MP4的字符串找出来,变成一个list 文件,
<files>
<file name="Bren Brown - The power of vulnerability.mp4">
<resources>
<url type="http">https://download.ted.com/talks/BreneBrown_2010X-low-en.mp4</url>
</resources>
</file>
<file name="Isabel Behncke - Evolutions gift of play from bonobo apes to humans.mp4">
<resources>
<url type="http">https://download.ted.com/talks/IsabelBehnckeIzquierdo_2011U-low-en.mp4</url>
</resources>
</file>
(2)网上找的原始解决方案
https://zhidao.baidu.com/question/560038575.html
results=re.findall("(?isu)(http\://[a-zA-Z0-9\.\?/&\=\:]+)")
open("urls.txt","wb").write("\r\n".joint(results))
(3)调试后的结果:import re
s=open("TEDEN.TXT","rb").read()
#results=re.findall("(?isu)(https\\://[a-zA-Z0-9\.\?/&\=\:]+)",s)
results=re.findall("(?isu)(https\\://[a-zA-Z0-9 _\-\.\?/&\=\:]+)",s)
with open("OUTPUT.txt","wb") as handle:
handle.write("\r\n".join(results))
(4)输出的文件内容:
https://download.ted.com/talks/BreneBrown_2010X-low-en.mp4
https://download.ted.com/talks/IsabelBehnckeIzquierdo_2011U-low-en.mp4
。。。。。。。
调试成功
(5)回顾
学到了re的符号含义,如何用正则式匹配你要的格式。
.join 和 .joint 的用法