【Python】学习Python爬虫遇到几个小问题！

最新推荐文章于 2024-08-14 19:17:19 发布

我在看图

最新推荐文章于 2024-08-14 19:17:19 发布

阅读量1.4k

点赞数 2

分类专栏： python Python专栏文章标签： Python 爬虫报错

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/twk121109281/article/details/90231979

版权

python 同时被 2 个专栏收录

18 篇文章 2 订阅

订阅专栏

Python专栏

11 篇文章 0 订阅

订阅专栏

最近学习python 写了几个简单的爬虫例子遇到下面问题废了我几个小时。

运行环境是：Python3

问题一：AttributeError: 'module' object has no attribute 'urlopen'

源码

报错

原因

后来发现我用的是python版本Python 3.4.3，经过查找我的原因是在python3.X中应该用urllib.request。

结果

修改成下面即可。

问题二：TypeError: must be str, not bytes

源码

报错

原因

文件打开方式不对，因为存储方式默认是二进制方式

结果

将open(shoplistfile,'w')和open(shoplistfile,'r')分别改成了open(shoplistfile,'wb')和open(shoplistfile,'rb')，就成功了。

最终源码

稍微封装一下：

# -*- coding:UTF-8 -*-
from urllib.request import urlopen

#写文件
def write_file(file,filename= 'test.txt'):
    pageFile = open(filename,'wb')#以写的方式打开pageCode.txt
    pageFile.write(file)#写入
    pageFile.close()#开了记得关

#获取到网页内容
def get_html(url):
    page = urlopen(url)
    htmlcode = page.read()
    return htmlcode
    
url1='https://baike.baidu.com/item/url'
htmlfile = get_html(url1)

write_file(htmlfile)

成果图

这样我的第一个基于Python3的爬虫小程序就完成了。

分割线

尝试下载网页中的图片

问题三：can't use a string pattern on a bytes-like object

源码

报错

原因

网上查找资料得知是编码问题，正则表达式是一个Unicode字符串，而urlopen（）返回来的类似文件对象的结果经过read（）方法得到的是一个ASCII/bytes字符串。

结果

修改如下即可：

imglist = reg_img.findall(htmlfile.decode('utf-8'))#进行匹配

问题4：imglist = reg_img.findall(htmlfile.decode('utf-8'))#进行匹配

原因

python2 与python3的urllib不同在与python3要加上.request

结果

修改如下即可：

urllib.request.urlretrieve(img, '%s.png' %x)

成果图

源码

 # -- coding: UTF-8 --
from urllib.request import urlopen
import urllib
import re

#写文件
def write_file(file,filename= 'test.txt'):
    pageFile = open(filename,'wb')#以写的方式打开pageCode.txt
    pageFile.write(file)#写入
    pageFile.close()#开了记得关

#获取到网页内容
def get_html(url):
    page = urlopen(url)
    htmlcode = page.read()
    return htmlcode
    
url1='http://tieba.baidu.com/p/6127945649'
reg = r'src="([^\s]+?\.jpg)'#正则表达式
reg_img = re.compile(reg)#编译一下，运行更快

htmlfile = get_html(url1)
#write_file(htmlfile)

imglist = reg_img.findall(htmlfile.decode('utf-8'))#进行匹配
#print('imglist :',imglist)
x=0
for img in imglist:
    print('img :',img)
    urllib.request.urlretrieve(img, './src/%s.jpg' %x)
    x += 1