爬取百度贴吧一个页面出现的问题分析

最新推荐文章于 2020-09-20 01:16:18 发布

hellenlee22

最新推荐文章于 2020-09-20 01:16:18 发布

阅读量313

点赞数

本文链接：https://blog.csdn.net/hellenlee22/article/details/89474801

版权

找了百度贴吧的一个页面做爬取：

#coding=utf-8
import requests
import os
import re

url='http://tieba.baidu.com/p/2460150866'

os.chdir(r'C:\Users\Administrator\Desktop\baidudata111')  # 更改工作目录为桌面

response=requests.get(url)
html=response.content
html=html.decode('utf-8')

#第一种，部分错误， 造成了部分匹配错误
reg=r'src="(.*?.jpg)" pic_ext="jpeg"'
#第二种 全部错误
#reg=r'src="(.+?)\.jpg" pic_ext'
#baidudata1
#第三种 对
#reg = r'src="(.+?\.jpg)" pic_ext'
#第四种 对
#imglist=re.findall('<img pic_type="0" class="BDE_Image" src="(http.*?jpg)".*?>',html.replace('\\',''),re.S)
#imglist=re.findall('<img pic_type="0" class="BDE_Image" src="(http.*?jpg)".*?>',html,re.S)
pattern1=re.compile(reg)
imglist=re.findall(pattern1,html)
#print(imglist)
print(len(imglist))

x=0
for imgurl in imglist:
	print(imgurl)


	bresponse=requests.get(imgurl)
	bhtml=bresponse.content

	f=open('./%s.jpg' % x,'wb')
	f.write(bhtml) 
	x=x+1

	f.close()

第一种的错误分析
抓取的链接有部分是错误的，一个匹配直到匹配到pic_ext="jpeg"才结束，例如第一个把pic_ext="bmp"和后面的jpeg的整合成一个链接匹配出来。
在这里插入图片描述
第二种：
只把.jpg前面的链接部分匹配出来, 这部分的链接是打不开的，故保存下来的图片都是打不来的。

第三种和第四种是可以正确匹配的

初学者，问题不是一般的多呀，一个接一个，真是抓狂

hellenlee22

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬取百度贴吧一个页面出现的问题分析

找了百度贴吧的一个页面做爬取：#coding=utf-8import requestsimport osimport reurl='http://tieba.baidu.com/p/2460150866'os.chdir(r'C:\Users\Administrator\Desktop\baidudata111') # 更改工作目录为桌面response=requests.g...
复制链接

扫一扫