爬虫问题记录——正则图片爬取

Yummy936

于 2023-02-05 13:27:54 发布

阅读量380

点赞数

文章标签：编辑器 html 爬虫 python

本文链接：https://blog.csdn.net/Yummy936/article/details/128889637

版权

背景：提取百度贴吧的某张网页的图片链接

    ex='<div id="post_content_146765581359" class="d_post_content j_d_post_content' \
       ' " style="display:;">.*?(<img class="BDE_Image" src=".*?" ' \
       'size=.*?" height=.*?>)</div>'

在(<img class=“BDE_Image” src=“.?" size=.?” height=.*?>)两端加括号，爬取出来的结果就是这样，后面的图片连接要自己再提取一下。 包含了很多无关的内容我只要src后面的链接

    ex='<div id="post_content_146765581359" class="d_post_content j_d_post_content' \
       ' " style="display:;">.*?<img class="BDE_Image" src="(.*?)" ' \
       'size=.*?</div>'

只在src="(.*?)"这里加括号，很好，你只能提取一条图片链接

    # ex='<div id="post_content_146765581359" class="d_post_content j_d_post_content' \
    #    ' " style="display:;">.*?<img class="BDE_Image" src="(.*?)" ' \
    #    'size=.*?" height=.*?><img class="BDE_Image" src="(.*?)" ' \
    #    'size=.*?" height=.*?><img class="BDE_Image" src="(.*?)" ' \
    #    'size=.*?" height=.*?><img class="BDE_Image" src="(.*?)" ' \
    #    'size=.*?" height=.*?></div>'

这种多复制几个<img class=“BDE_Image” src=“(.?)" size=.?” height=.*?>，除非我有未卜先知的能力，知道有几个图片链接，无语

所以这个正则表达式应该怎么写，能简洁的一次性把该页面的图片链接都搞出来。。。

嘿嘿，解决了，先findall找出div里面的代码，然后再findall找img相同格式的，这样就一句img正则就可以了

    ex1='<div id="post_content_146765581359" class="d_post_content j_d_post_content' \
       ' " style="display:;">.*?(<img class="BDE_Image" src=".*?" size=.*?" height=.*?></div>)'
    img_src_list1 = re.findall(ex1, page_data, re.S).__str__()

    ex2='<img class="BDE_Image" src="(.*?)" size=.*?" height=.*?>'
    img_src_list2 = re.findall(ex2, img_src_list1, re.S)
    print(img_src_list2)