python3写爬虫程序时，遇到的问题及解决方法

最新推荐文章于 2024-07-03 08:38:12 发布

山鬼谣me

最新推荐文章于 2024-07-03 08:38:12 发布

阅读量3.2w

点赞数 7

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/u013066244/article/details/53120731

版权

python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

cannot use a string pattern on a bytes-like object

这个错误我是发生在以下代码：

re.findall(pattern, data)

这个时候如何data的数据类型为bytes，时，就会包这个错误，因为它需要的是字符串。
我们可以把上面的代码改成

type(data)
re.findall(pattern, data)

打印的结果：

<class 'str'>

所以我们要在使用re.findall()方法之前，先把data类型转为字符串str类型。方法：

re.findall(pattern, data.decode('utf-8'))

其中decode和encode方法转换流程：
      decode              encode

bytes ------> str(unicode)------>bytes

参考链接：

http://blog.csdn.net/moodytong/article/details/8136258
http://blog.csdn.net/riyao/article/details/3629910

第二个参考链接中说：现在`python3`中findall参数类型改了为`chart-like`也就是str,
我这里要说明下，我查了官方文档，即使在python2中也是str类型。参数类型并没有改。

‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

出现这个原因是首先你的header里面配置了

'Accept-Encoding':' gzip, deflate'

其次就是调read()方法时，又调了decode('utf-8')方法，如下：

data = op.read().decode('utf-8')
#因为op.read()数据还没有解压再去调decode()方法会报上面异常；

Accept-Encoding这句话：本地可以接收压缩格式的数据，而服务器在处理时就将大文件压缩再发回客户端，
浏览器在接收完成后在本地对这个文件又进行了解压操作。出错的原因是因为你的程序没有解压这个文件。

重点来了：网上给的答案是，把Accept-Encoding删除掉

我认为既然都知道是没有解压的原因，我们解压不就行了，何必删除。
况且像知乎这样的网站，你爬来的数据不解压都读取不出来。
所以我们正确的做法应该是先解压数据，这里给出我的解压代码：

# 解压


def ungzip(data):
    try:
        print('正在解压。。。。')
        data = gzip.decompress(data)
        #data = gzip.decompress(data).decode('utf-8')
        print('解压完毕')
    except:
        print('未经压缩，无需解压')
    return data.decode('utf-8')

我这里读取的代码要这样写

data = op.read()
#data = op.read().decode('utf-8') 
#千万别写成这样，因为op.read()数据还没有解压再去调decode()方法会报上面异常；
data = ungzip(data)

a bytes-like object is required, not ‘list’

首先我给出全部代码：

# -*- coding:utf-8 -*-
import re
import urllib
import urllib.request
import gzip
import http.cookiejar
import io
import sys
import string
# gb18030
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
# 解压


def ungzip(data):
    try:
        print('正在解压。。。。')
        data = gzip.decompress(data)
        print('解压完毕')
    except:
        print('未经压缩，无需解压')
    return data.decode('utf-8')

# 获取xsrf


def getXSRF(data):
    cer = re.compile('name="_xsrf" value="(.*)"', flags=0)
    strlist = cer.findall(data)
    return strlist[0]

# 封装好请求头


def getOpener(head):
    # deal with the Cookies
    cj = http.cookiejar.CookieJar()
    pro = urllib.request.HTTPCookieProcessor(cj)
    opener = urllib.request.build_opener(pro)
    header = []
    for key, value in head.items():
        elem = (key, value)
        header.append(elem)
    opener.addheaders = header
    return opener

# 保存


def saveFile(data):
    data = data.encode('utf-8')
    save_path = 'E:\\temp.out'
    f_obj = open(save_path, 'wb')  # wb表示打开方式
    f_obj.write(data)
    f_obj.close()


# 请求头值
header = {
    'Connection': 'Keep-alive',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Accept-Encoding': 'gzip,deflate',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Host': 'www.qiushibaike.com'
}


# page = 1
url = 'http://www.qiushibaike.com/hot/'
# 获得请求头
# opener = getOpener(header)
# op = opener.open(url)
# data = op.read()
# data = ungzip(data)  # 解压
# _xsrf = getXSRF(data.decode())


try:
    opener = getOpener(header)
    op = opener.open(url)
    data = op.read()

    data = ungzip(data)
    # op = urllib.request.urlopen(url)
    strRex = ('<div class="author clearfix">.*?<h2>(.*?)</h2>.*?<div class="articleGender.*?">(.*?)</div>' +
              '.*?<div class="content">(.*?)</div>(.*?)<div class="stats.*?class="number">(.*?)</i>')
    pattern = re.compile(strRex, re.S)
    print(type(data))
    items = re.findall(pattern, data)

    for item in items:
          print(item[0] + item[1] + item[2] + item[3])
    #     print(item)
    print(items)
    # saveFile(''.join(str(e) for e in items))#正确的代码
    saveFile(items)
except Exception as e:
    print(e)

出现上面错误的原因是执行这句代码：

saveFile(items)

而在saveFile函数中，`f_obj = open(save_path, 'wb')`可以看出，“wb”，
以二进制的模式打开文件，并且可写。而我插入的items是个数组，所以报错。

我查阅一些资料后,发现，可以去掉b参数，也就是变成：

f_obj = open(save_path, 'w')

执行代码后发现，它又要求传入str字符串类型。也就是说，当我们不指定已二进制(b)的形式打开文件时，
默认写入的是str类型。
解决思路：先把要写入文件的数据类型转为str类型，再在saveFile，把str类型转为bytes。
首先依然是以二进制打开文件，我们在saveFile方法中添加

data = data.encode('utf-8')

encode这个方法就是把str类型转为bytes类型。
list和tuple元组要利用`''.join()`方法来转换为str
一开始我写成

saveFile(items)

结果它又报：

sequence item 0: expected str instance, tuple found

它是说，它在序列中拿第一个元素时，希望得到str字符串，结果它发现是元组(tuple);
也就意味着我们要先遍历数组把元组转为str字符串类型。

''.join(str(e) for e in items)

这句代码从右往左看，先去遍历items，然后在对每项使用str()方法进行转换为str类型。
而每项是元组，所以又要使用join函数来把元组转成str类型`''.join()`.
还有就是元组转成str字符串，打印的效果和没有转是一样的，但是用type(str(e))方法能看出区别。

    s = ('a', 'b', 'c')
    print(str(s))
    print(s)
    print(type(str(s)))
    print(type(s))

打印结果是：

('a', 'b', 'c')
('a', 'b', 'c')
<class 'str'>
<class 'tuple'>

所以解决办法就是把saveFile(items)改成：

saveFile(''.join(str(e) for e in items))

最后贴出 list tuple str之间的相互转换

list()方法是把字符串str或元组转成数组
tuple()方法是把字符串str或数组转成元组

>>> s = "xxxxx"
>>> list(s)
['x', 'x', 'x', 'x', 'x']
>>> tuple(s)
('x', 'x', 'x', 'x', 'x')
>>> tuple(list(s))
('x', 'x', 'x', 'x', 'x')
>>> list(tuple(s))
['x', 'x', 'x', 'x', 'x']

列表和元组转换为字符串则必须依靠join函数

>>> "".join(tuple(s))
'xxxxx'
>>> "".join(list(s))
'xxxxx'
>>> str(tuple(s))
"('x', 'x', 'x', 'x', 'x')"#要是使用sublime text 3插件sublimeREPl，是不会显示外层的双引号的。上面同理。
>>>

参考链接：

http://blog.csdn.net/sruru/article/details/7803208
http://stackoverflow.com/questions/5618878/how-to-convert-list-to-string
http://piziyin.blog.51cto.com/2391349/568426
http://stackoverflow.com/questions/33054527/python-3-5-typeerror-a-bytes-like-object-is-required-not-str