Python 爬虫学习笔记(六)

最新推荐文章于 2024-07-23 14:36:35 发布

柠檬汽水橘子汁

最新推荐文章于 2024-07-23 14:36:35 发布

阅读量200

点赞数

分类专栏： Python 爬虫文章标签： python

本文链接：https://blog.csdn.net/sinat_39665351/article/details/105231567

版权

Python 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

爬虫

12 篇文章 0 订阅

订阅专栏

python 爬虫学习笔记(六)

【Python网络爬虫与信息提取】.MOOC. 北京理工大学

淘宝商品信息定向爬取
- 定向爬虫可行性判断：robots协议
- 程序结构设计：
  1. 提交请求
  2. 提取每个页面信息
  3. 将信息输出到屏幕
步骤：（参考博客）
1. 登录淘宝网
2. (Chrom浏览器)F12，选择network
3. 输入任意商品信息，点击搜索
4. 找到search?q=…右键 => copy => copy as cURL(bash)
5. 打开网址：https://curl.trillworks.com/#python，将复制的内容粘贴到左边的curl command框中，得到Python requests，复制该框中的header内容，粘贴到getHtmlText(url)函数中

# 视频43 淘宝商品信息爬取
import requests
import re


def getHtmlText(url): #获得页面
    try:
        header = {
            'authority': 's.taobao.com',
            'cache-control': 'max-age=0',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
            'sec-fetch-user': '?1',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'sec-fetch-site': 'same-origin',
            'sec-fetch-mode': 'navigate',
            'referer': 'https://blog.csdn.net/Guanhai1617/article/details/104120581',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'cookie': '写自己的cookie',
        }  
        r = requests.get(url, headers=header)
        r.raise_for_status()
        r.encoding = r.apparent_encoding

        return r.text
    except:
        print("爬取失败")
        return ""


def parsePage(ilist, html): #对获得的页面进行解析
    try:
        plt = re.findall(r'\"view_price\":\"\d+\.\d*\"', html)
        tlt = re.findall(r'\"raw_title\":\".*?\"', html) #*?最小匹配
        # print(tlt)
        print(len(plt))
        for i in range(len(plt)):
            price = eval(plt[i].split('\"')[3])
            title = tlt[i].split('\"')[3]
            ilist.append([title, price])
        # print(ilist)
    except:
        print("解析出错")


def printGoodsList(ilist, num): #输出商品信息
    print("=====================================================================================================")
    tplt = "{0:<3}\t{1:<30}\t{2:>6}"
    print(tplt.format("序号", "商品名称", "价格")) #打印输出信息表头
    count = 0 #输出信息的计数器
    for g in ilist:
        count += 1 #商品的序号
        if count <= num:
            print(tplt.format(count, g[0], g[1]))
    print("=====================================================================================================")


def main(): #主函数
    goods = "篮球"  #搜索关键词
    depth = 1  #向下一页爬取的深度
    start_url = "https://s.taobao.com/search?q=" + goods  #爬取信息的url
    infoList = []  #输出结果变量
    num = 20
    for i in range(depth): #对每个页面单独处理
        try:
            url = start_url + '$S=' + str(44 * i)  #对每个页面url链接进行设计，已知每个页面的起始有变量s，s以44为倍数
            html = getHtmlText(url)  #用get方法获取网页
            parsePage(infoList, html)  #处理页面解析过程
        except:
            continue  #某个页面出现异常则对下一个页面继续解析

    printGoodsList(infoList, num)  #打印结果信息，信息都保存在infoList中


main() #调用main函数使整个程序运行

柠檬汽水橘子汁

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python 爬虫学习笔记(六)

python 爬虫笔记(六)【Python网络爬虫与信息提取】.MOOC. 北京理工大学淘宝商品信息定向爬取定向爬虫可行性判断：robots协议程序结构设计：提交请求提取每个页面信息将信息输出到屏幕步骤：（参考博客）登录淘宝网(Chrom浏览器)F12，选择network输入任意商品信息，点击搜索找到search?q=…右键 => ...
复制链接

扫一扫

专栏目录