通过requests库re库进行淘宝商品爬虫爬取（对中国大学mooc嵩天老师爬虫进行修改）

阿瞒oman

于 2020-03-09 21:38:48 发布

阅读量3.6k

点赞数 11

文章标签： python 正则表达式 curl

本文链接：https://blog.csdn.net/Omann/article/details/104759719

版权

中国大学mooc上的爬取淘宝页面商品已经因为淘宝的维护而无法爬取

比如，只出现个表头：
[外链图片转存失败,源站可能有防盗在这里插入!链机制,建描述]议将图片上https://传(implog.csdnimg.cn/20203Sdbz309195430123.png4)(https://img一直-blog.csdnimg.cn/20200309195430123.png)]
这是我按照嵩天老师代码学习，遇到的问题。

原代码如下：

import requests
import re
def getHTMLText(url):
   
   try:
        r= requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
   except:
        return ""

def parsePage(ilt,html):
    try:
        plt = re.findall(r'\"view_price\":\"[\d+\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])

最低0.47元/天解锁文章

阿瞒oman

关注

11
点赞
踩
34

收藏

觉得还不错? 一键收藏
34
评论
通过requests库re库进行淘宝商品爬虫爬取（对中国大学mooc嵩天老师爬虫进行修改）

中国大学mooc上的爬取淘宝页面商品已经因为淘宝的维护而无法爬取比如，只出现个表头：这是我按照嵩天老师代码学习，遇到的问题。原代码如下：import requestsimport redef getHTMLText(url): try: r= requests.get(url,timeout=30) r.raise_for_status(...
复制链接

扫一扫