百度贴吧爬虫

最新推荐文章于 2024-04-08 08:34:21 发布

Nick12138_2017

最新推荐文章于 2024-04-08 08:34:21 发布

阅读量644

点赞数 1

分类专栏： 1 文章标签： python 爬虫

本文链接：https://blog.csdn.net/Nick12138_2017/article/details/79247234

版权

1 专栏收录该内容

22 篇文章 2 订阅

订阅专栏

百度贴吧爬虫

环境： vs2017+python 3.6
第三方库：BeautifulSoup 4.6.0

爬的帖子：https://tieba.baidu.com/p/3954777778?see_lz=1&pn=1
第一次写爬虫，写的比较乱，请见谅

from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import re#导入正则表达式

def get_post_id(postlist):#获取该页所有楼的post_id
    post_ids=[]#创建post_id列表
    for child in postlist:
        try:
            data_field=child['data-field']#找到这个字符串
            pattern=re.compile(r"\d{10,12}")#定义正则表达式样式
            matcher=re.search(pattern,data_field) #在标签中搜索post_id
            post_ids.append(matcher.group(0))#将其加入到post_id列表中
        except KeyError:
            pass
    return post_ids#返回post_id列表

def download_pictures(page,bsObj):#下载某一页的图片
    pic_list=bsObj.findAll("img",{"class":"BDE_Image"})#找到所有图片地址所在的标签

    i=0#对该页的图片计数
    for pic_url in pic_list:
        file_name="D:/pics/"+str(page)+'_'+str(i)+'.jpg'#定义图片保存路径
        urlretrieve(pic_url['src'],file_name)#下载图片
        i+=1
        print("正在下载第"+str(page)+'页第'+str(i)+"张图片")#输出提示信息

def get_total_page(url):#获取帖子页码上限
    html=urlopen(url)
    bsObj=BeautifulSoup(html)
    info=bsObj.findAll("span",{"class":"red"})#找到存有页码上限信息的标签
    total_page=int(info[-1].get_text())
    return total_page

def all_pages(url,total_page):
    page=1#帖子页码
    while page<=total_page:
        url+="&pn="+str(page)#根据帖子页码更改帖子网址
        try:
            html=urlopen(url)#访问网址
        except HTTPError:#如果网址不存在则退出循环
            break
        else:
            bsObj=BeautifulSoup(html)#创建一个BeautifulSoup对象

            #找到楼层列表的父标签
            postlist=bsObj.find("div",{"class":"p_postlist"}).children

            post_ids=get_post_id(postlist)#获取每层楼的post_id
            download_pictures(page,bsObj)#下载这一层的照片
        page+=1#页码加一，开始下一页

"""本程序可用于下载百度贴吧的图片（只看楼主的模式）"""


#url=input("请输入要爬取的百度贴吧帖子网址（例如https://tieba.baidu.com/p/3954777778）")
#url+="?see_lz=1"
url="https://tieba.baidu.com/p/3954777778?see_lz=1"

total_page=get_total_page(url)
all_pages(url,total_page)#下载所有页的图片

这里写图片描述
页码的标签

Nick12138_2017

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
百度贴吧爬虫

百度贴吧爬虫环境： vs2017+python 3.6 第三方库：BeautifulSoup 4.6.0爬的帖子：https://tieba.baidu.com/p/3954777778?see_lz=1&pn=1 第一次写爬虫，写的比较乱，请见谅from urllib.request import urlopenfrom bs4 import BeautifulSoupfro
复制链接

扫一扫