python爬虫学习问题总结

最新推荐文章于 2022-07-26 11:33:02 发布

CRISTIANO Xusanduo

最新推荐文章于 2022-07-26 11:33:02 发布

阅读量662

点赞数

分类专栏： python 文章标签： python error

本文链接：https://blog.csdn.net/shi_xin/article/details/88798444

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

这里记录一些在根据视频/网站学习爬虫时，遇到的一些问题。一般是由于视频/网站时间较早，相关代码语法网站等需要修改。这里做一些简单记录，当然，可能过了半载一年，又失效了。

一、有道翻译
解决：有道翻译 ‘errorCode’: 50
根据有道翻译的结果页信息，获取到的url为：http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule
使用这个地址，是无法爬虫成功的，会报错。需要进行修改之后，才会返回正常结果。
修改如下（去掉translate_o中的_o）：

tobe_translate = input('Please input your word:')
# tobe_translate = '你好' #开发时避免每次输入耽误时间
data = {
    'i': tobe_translate,
    'from': 'AUTO',
    'to': 'AUTO',
    'smartresult': 'dict',
    'client': 'fanyideskweb',
    'salt': '15534990752679',
    'sign': '8068ceaab29dca41031a3695a052208a',
    'ts': '1553499075267',
    'bv': '22c4e55facde8e7a20b16e256e9fdfa1',
    'doctype': 'json',
    'version': '2.1',
    'keyfrom': 'fanyi.web',
    'action': 'FY_BY_REALTlME',
    'typoResult': 'false'}


# data转换成request需要的数据类型
data = urllib.parse.urlencode(data).encode('utf-8')

# 发送请求
youdaofanyi = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'

整体源代码获取地址 https://github.com/shixin398/Python3
二、ooxx妹子图爬取
ooxx对妹子图地址进行了反扒处理，用python直接抓取网页获得的信息是：

<img src="//img.jandan.net/img/blank.gif" "jandan_load_img(this)" /><span class="img-hash">Ly93czEuc2luYWltZy5jbi9tdzYwMC82MjMwNmVlYWx5MWcxaGdpMmllOXhqMjB1MDEzeTdwby5qcGc=</span></p>

需要查看网页中其相关实现，用chrome打开ooxx没组图页，然后F12，选择network，搜寻相关函数，查看其实现即可（参考分析：https://www.cnblogs.com/sjfeng1987/p/9221920.html）。
具体代码实现如github地址（https://github.com/shixin398/Python3/blob/master/spider/ooxx/real_xxoo.py），喜欢的话，顺路star一下。

def find_pic_url(url):
    html = url_open(url).decode('utf-8')
    # print(html)

    # 提取当前网页中每张图片的hash
    # 使用非贪婪匹配
    pic_hash = re.findall(r'<span class="img-hash">(.*?=)</span>', html)

    # TODO:拼接图片网址, ooxx工程师挺坑啊，写了一堆代码，其实都是糊弄人的。
    # 网址就是：base64_decode(d)，d就是pic_hash
    pic_list = []
    for each in pic_hash:
        temp = urlsafe_b64decode(each)
        # TODO:decode返回值是bytes格式:<class 'bytes'>
        # b'//ws1.sinaimg.cn/mw600/6e6f0cd7gy1g1h9t4wzbmj218w0u0jxp.jpg'，需要强制转换并切片
        temp_len = len(temp)
        # temp_len得到的长度不包括首字母b'，所以切片时尾部需要+2，头部从2开始
        temp_str = str(temp)[2: temp_len + 2]
        # 拼接成http网址，否则urllib无法访问
        temp_list = 'http:' + temp_str
        # print(temp_list)
        pic_list.append(temp_list)
    # print(pic_list)
    return pic_list

其中一些注意点，都写在readme中了