从html文件中提取超链接URL的shell脚本

最新推荐文章于 2021-06-16 10:28:58 发布

weixin_34197488

最新推荐文章于 2021-06-16 10:28:58 发布

阅读量1.2k

点赞数

文章标签： shell awk python

原文链接：https://yq.aliyun.com/articles/526874

版权

sed -n '/<a /p' html.txt | sed 's#<a $[^>]*$>#--SYN--\1--FIN--#g; s/<//g; s/>//g' | \ sed 's/--SYN--/</g; s/--FIN--[^<]*</></g; s/[^<]*</</; s/--FIN--.*/>/;' | \ sed "s#<[^>]*href=$[^a-zA-Z>]*http://[^ >]*$[^>]*># @\1@#g; s/<[^>]*>//g; s/'//g; s/@/ /g" > url.txt

这里提取的是 <a href="http://domain/path/to/html.html"> 中的 http://domain/path/to/html.html
也就是
1、只匹配html的Tag为a的节点。
2、选择的是href的值。
3、href的值需要使用http://开头，就是说不支持相对路径。

写成sed脚本可以表示为:

# this script is use to dig href url from html file s/<a $[^>]*$>/--SYN--\1--FIN--/g; s/[><]//g; s/--FIN--/>/g; s/--SYN--/</g; s/^$.*$$/>\1</; s/>[^<]*</></g; s#<[^>]*href=[^a-zA-Z>]*$http://[^ >]*$[^>]*>#@\1@#g; s/<[^>]*>//g; s/@@/\ /g; s/[><'"@]//g;

/^ *$/d;

sed脚本2:

:a; h; s@^[^<]*<a\s*[^>]*\s*href\s*=\s*['"]*$http://[^> "']*$[^>]*>.*@\1@p; g; s@<[a-zA-Z/][a-zA-Z]*[^>]*>@@; t a; /<[a-zA-Z\/][a-zA-Z]*[^>]*$/{N; b a; }; d;

本文转自 chengxuyonghu 51CTO博客，原文链接：http://blog.51cto.com/6226001001/1612987，如需转载请自行联系原作者

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

weixin_34197488

关注关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
从html文件中提取超链接URL的shell脚本

sed-n'/<a /p'html.txt|sed's#<a $[^>]*$>#--SYN--\1--FIN--#g; s/<//g; s/>//g'|\sed's/--SYN--/</g; s/--FIN--[^<]*</></g; s/[^<]*&lt...
复制链接

扫一扫