本文分享的是如何用python爬取百度贴吧的图片和基本信息,虽然知识点不算很难,当作为自己的笔记整理之余还可以分享给我们这些热爱学习的人,我觉得我很愿意,同时也希望各位大佬指点出错误或更好的方法,时刻保持谦卑好学是我的宗旨!
预想达到的目的:
输入想要爬取的贴吧名字,输入起始页和停止页,程序会自动爬取输入页码范围内的所有图片和基本信息
完整代码在最后!
首先,分析贴吧不同页数的url规律:
以周杰伦吧为例:
第一页:
第二页:
第三页:
(ps:要翻动一下页数网址才会和我放出来的一样)
我们可以发现每一页url不同的地方就只是后面的“pn=”,即我们只要改变pn参数就可以得到不同页的url,同理“kw=”为贴吧名
你们应该好奇为什么我上面不直接复制url而用截图,是因为复制下来是这样的:
https://tieba.baidu.com/f?kw=%E5%91%A8%E6%9D%B0%E4%BC%A6&ie=utf-8&pn=0
我们可以发现“kw=”不是周杰伦(贴吧名),而是一串编码,所以这里我们需要将贴吧名编码(可用到这里详细了解)
url规律为:
pn=(页数-1)*50
kw=贴吧名进行编码后的所得
其中URL中的&ie=utf-8去掉也是一样可用的,下面我们就默认URL都是去掉&ie=utf-8的
def teibaSpider(url):
name = input('请输入贴吧名字:')
beginPage = int(input('请输入起始页:'))
endPage = int(input('请输入结束页:'))
kw = {'kw':name}
ret = parse.urlencode(kw)
print(ret)
url = url + ret + '&pn='
for page in range(beginPage,endPage+1):
pn = (page-1) * 50
fullurl = url + str(pn)
print(fullurl)
html = loadPage(fullurl)
filename = name+'吧第%s页.html'%page
#tiebaInfo = name+'吧第%s页.html'%page + 'Info'
writePage(html, filename)
tiebaInfo(html)
此处传入的URL为最基本的贴吧URL:https://tieba.baidu.com/f?
ret为贴吧名name编码后的所得
def loadPage(url) :
#headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
#req = request.Request(url,headers=headers) #构建请求体
response = request.urlopen(url) #发送请求-得到响应对象
html = response.read() #读取响应内容
return html
def writePage(html,filename):
html = html.decode('utf-8')
with open(filename,'w',encoding='utf-8') as f:
f.write(html)
print('正在下载%s·····'%filename)
得到网页html内容写成html文件保存到本地
def tiebaInfo(html):
# 解析HTML文档
content = etree.HTML(html)
print(content)
# 通过xpath规则匹配对应的数据信息
title_list = content.xpath("//div[@class='t_con cleafix']/div/div/div/a/text()")
link_list = content.xpath("//div[@class='t_con cleafix']/div/div/div/a/@href")
replies_list = content.xpath("//div[@class='t_con cleafix']/div/span/text()")
writer_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[1]/div[2]/span[1]/@title")
introduce_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div/div/text()")
lastResponer_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div[2]/span[1]/@title")
lastResponTime_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div[2]/span[2]/text()")
#print(type(lastResponTime_list))
for title, link ,replies,writer,introduce,lastResponer,lastResponTime in zip(title_list, link_list, replies_list,writer_list,introduce_list,lastResponer_list,lastResponTime_list):
fulllink = 'https://tieba.baidu.com'+link
info = ' 标题:%s\n 链接:%s\n 回复数:%s\n 楼主名:%s\n %s\n 最后回复时间:%s\n 简介:%s\n '%(title, fulllink ,replies,writer,lastResponer,lastResponTime,introduce)
print(info)
loadImage(fulllink)
filename = 'tiebaInfo'
writeInfo(info, filename)
传入刚刚得到的html;解析html文档;通过xpath规则匹配对应的数据信息,分别为标题,图片链接(图片url),回复数量,楼主名,最后回复时间,简介。
通过loadImage()加载图片url,writeInfo()把以上信息保存起来,writeImage()把图片保存到本地
def writeInfo(info,filename):
with open(filename, 'a', encoding='utf-8') as f:
f.write(info)
def loadImage(url):
'''匹配图片url'''
html = loadPage(url) #发送请求得到响应内容
content = etree.HTML(html) #解析html文档
imgUrl_list = content.xpath("//img[@class='BDE_Image']/@src")
for imgUrl in imgUrl_list:
print(imgUrl)
writeImage(imgUrl)
def writeImage(url):
'''将图片写入到本地'''
img = loadPage(url)
#filename = url[-15:]
global i
i += 1
filename = str(i) + '.jpg'
with open('E:\\Pycharm\\workSpace\\day2\\image\%s'%filename,'wb') as f:
f.write(img)
print('正在下载%s图片'%filename)
完整代码:
from urllib import request,parse
from lxml import etree
global i
i = 0
def loadPage(url) :
#headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
#req = request.Request(url,headers=headers) #构建请求体
response = request.urlopen(url) #发送请求-得到响应对象
html = response.read() #读取响应内容
return html
def writePage(html,filename):
html = html.decode('utf-8')
with open(filename,'w',encoding='utf-8') as f:
f.write(html)
print('正在下载%s·····'%filename)
def teibaSpider(url):
name = input('请输入贴吧名字:')
beginPage = int(input('请输入起始页:'))
endPage = int(input('请输入结束页:'))
kw = {'kw':name}
ret = parse.urlencode(kw)
print(ret)
url = url + ret + '&pn='
for page in range(beginPage,endPage+1):
pn = (page-1) * 50
fullurl = url + str(pn)
print(fullurl)
html = loadPage(fullurl)
filename = name+'吧第%s页.html'%page
#tiebaInfo = name+'吧第%s页.html'%page + 'Info'
writePage(html, filename)
tiebaInfo(html)
def writeInfo(info,filename):
with open(filename, 'a', encoding='utf-8') as f:
f.write(info)
def loadImage(url):
'''匹配图片url'''
html = loadPage(url) #发送请求得到响应内容
content = etree.HTML(html) #解析html文档
imgUrl_list = content.xpath("//img[@class='BDE_Image']/@src")
for imgUrl in imgUrl_list:
print(imgUrl)
writeImage(imgUrl)
def writeImage(url):
'''将图片写入到本地'''
img = loadPage(url)
#filename = url[-15:]
global i
i += 1
filename = str(i) + '.jpg'
with open('E:\\Pycharm\\workSpace\\day2\\image\%s'%filename,'wb') as f:
f.write(img)
print('正在下载%s图片'%filename)
def tiebaInfo(html):
# 解析HTML文档
content = etree.HTML(html)
print(content)
# 通过xpath规则匹配对应的数据信息
title_list = content.xpath("//div[@class='t_con cleafix']/div/div/div/a/text()")
link_list = content.xpath("//div[@class='t_con cleafix']/div/div/div/a/@href")
replies_list = content.xpath("//div[@class='t_con cleafix']/div/span/text()")
writer_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[1]/div[2]/span[1]/@title")
introduce_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div/div/text()")
lastResponer_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div[2]/span[1]/@title")
lastResponTime_list = content.xpath("//div[@class='t_con cleafix']/div[2]/div[2]/div[2]/span[2]/text()")
#print(type(lastResponTime_list))
for title, link ,replies,writer,introduce,lastResponer,lastResponTime in zip(title_list, link_list, replies_list,writer_list,introduce_list,lastResponer_list,lastResponTime_list):
fulllink = 'https://tieba.baidu.com'+link
info = ' 标题:%s\n 链接:%s\n 回复数:%s\n 楼主名:%s\n %s\n 最后回复时间:%s\n 简介:%s\n '%(title, fulllink ,replies,writer,lastResponer,lastResponTime,introduce)
print(info)
loadImage(fulllink)
filename = 'tiebaInfo'
writeInfo(info, filename)
if __name__ == '__main__':
url = 'https://tieba.baidu.com/f?'
teibaSpider(url)
接下来我们来测试一下:
爬取 周杰伦吧 第2到第5页
运行结果:
爬取到的图片:
tiebainfo文件:
HTML内容:
到此,我们已经可以实现爬取指定贴吧的指定页码的内容和图片了
若哪里还可以优化可以告诉我,我会虚心学习的
若有哪里还可以提升的,欢迎各位指点。
第一次写,若写得不好请见谅!