[小爬虫分享]爬取某贴吧首页帖子中的所有图片

最新推荐文章于 2021-08-27 00:43:39 发布

今天开始学python

最新推荐文章于 2021-08-27 00:43:39 发布

阅读量282

点赞数

本文链接：https://blog.csdn.net/pikapika_chu/article/details/96794208

版权

编写过程中遇到了两个小的知识盲点，每一个都耽误了我两个小时以上，这是看视频教程然后做项目，纯自学没人指导这一学习方式的缺点。如果改为自己看操作文档的话，应该会好一点。

两个盲点分别是：

xpath 的用法：首先在头部导入，然后使用之前要标准化
访问图片链接，使用了f. write，url_open，怎么改都不成，最后一个函数搞定： urllib.request.urlretrieve（）

下面是程序源代码：

import urllib.request,os,json
import pprint
import re
from lxml import etree


#本程序功能说明：可以保存涵吧首页所有帖子中的图片

def url_open(url):
    request = urllib.request.Request(url)
    response=urllib.request.urlopen(url)
    
    html=response.read().decode("utf-8")

    return html

def find_tzurl(html):
    
    #使用xpath要注意两点：首先要在头文件导入，其次要对html初始化、标准化
    html = etree.HTML(html)
    ids = html.xpath('//div[@class="threadlist_title pull_left j_th_tit "]/a/@href')
    #print(ids)
    #print(set(ids))
    tz_urls =[]                 #tz
    for each in ids:
        furl = 'http://tieba.baidu.com' + each
        tz_urls.append(furl)
    #print(tz_urls)
    return tz_urls

def save_img(url):
    html = url_open(url)
    html = etree.HTML(html)
    tztitle = html.xpath('//div[@class="core_title core_title_theme_bright"]/h1/@title')    #得到了帖子中的名字
    imgurls = html.xpath('//div/img[@class="BDE_Image"]/@src')      #得到了每个帖子中所有图片
    if not os.path.exists("e:/pic" + tztitle[0]):
        os.mkdir("e:/pic/" + tztitle[0])
    print(tztitle)
    

    for each in imgurls:
        print(each)
        filename = each.split('/')[-1]

        #  ！！！！一定要注意访问图片用什么访问方式！！！！
        urllib.request.urlretrieve(each,"e:/pic/" + tztitle[0] +'/'+filename,None)
   
def main():  

    url = 'http://tieba.baidu.com/f?kw=%E5%BC%A0%E9%9F%B6%E6%B6%B5&ie=utf-8&fr=wwwt'
    html = url_open(url)
    tz_urls = find_tzurl(html)          #得到url链接中的每个帖子地址
    print(tz_urls[1])
    for each in tz_urls:
        save_img(each)
    
if __name__ == '__main__':
    main()

今天开始学python

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
[小爬虫分享]爬取某贴吧首页帖子中的所有图片

编写过程中遇到了两个小的知识盲点，每一个都耽误了我两个小时以上，这是看视频教程然后做项目，纯自学没人指导这一学习方式的缺点。如果改为自己看操作文档的话，应该会好一点。两个盲点分别是：xpath 的用法：首先在头部导入，然后使用之前要标准化访问图片链接，使用了f. write，url_open，怎么改都不成，最后一个函数搞定： urllib.request.urlretrieve（）下面...
复制链接

扫一扫