用python通过selenium自动化测试抓取天猫店铺数据

最新推荐文章于 2024-08-08 22:15:00 发布

退伍老兵励志成为大牛的艰辛历程

最新推荐文章于 2024-08-08 22:15:00 发布

阅读量1.6k

点赞数 20

分类专栏：笔记

笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

用python通过selenium自动化测试抓取天猫店铺数据
运行的环境在win10，软件用的是vscode。大家平常在抓取天猫店铺的时候登陆后会需要验证，我的方法是通过谷歌插件跳过天猫的登陆。
首先要下载chromedriver.exe放到python安装的位置，这里面不详细解释，自己可以去百度搜索。需要用到selenium这个模块。

from selenium import webdriver as wb

from bs4 import BeautifulSoup 
import csv
import pyautogui
import PIL
import time
import json
# pyautogui.PAUSE = 0.5 
# 调用模块

class Taobao:
    def __init__(self):
        self.url = 'https://chenguang.tmall.com/search.htm'
        #下面这三行代码必须要有的 
        self.options = wb.ChromeOptions()
        self.options.add_experimental_option('excludeSwitches', ['enable-automation'])  # 切换到开发者模式
        self.browser = wb.Chrome(options=self.options)
        self.data = []
        self.doc = {}

重点来了，这部分的代码是跳过登陆验证，用的是谷歌的插件获取cookie，复制粘贴到一个txt文件里面，放到工作区。
在这里插入图片描述

紧接着上代码

def get_data(self):
        
        self.browser.maximize_window()#确保窗口最大化确保坐标正确
        self.browser.get(self.url)#重点在这下面的代码，可以跳过登陆
        self.browser.delete_all_cookies()
        f1 = open('cookie.txt')#读取获取到的cookies
        cookie = f1.read()
        cookie = json.loads(cookie)
        for i in cookie:
            self.browser.add_cookie(i)#注入cookies
        

        print('注入完毕')

        time.sleep(2)#等待两秒

        self.get_shop()
        self.get_fanye()

注入cookie完成，我们就可以抓取数据了，我用的是Xpath，当然也可以用BeautifulSoup。注意一点，用BeautifulSoup需要获取page_source，代码走起。

def get_shop(self):#查找店铺获取数据 

        self.browser.get(self.url)
        time.sleep(5)
        self.browser.find_element_by_xpath('//a[@atpanel=",d,,,shopsearch,3,shopfilter,682114580"]').click()#按销量排序
        time.sleep(5)
        self.get_pictures()
    def get_pictures(self):#获取数据
        lists = self.browser.find_elements_by_xpath('//div[@class="item5line1"]/dl[contains(@class,"item") or @class="item last"]')
        time.sleep(4)
        print(lists)
        

        # #循环写入之前创建的例表里面
        for li in lists[:-10]:
            pictuer = li.find_element_by_xpath('.//dt[@class="photo"]/a[@class="J_TGoldData"]/img').get_attribute('src')#记住遍历的时候前面要加点
            name =  li.find_element_by_xpath('.//dd[@class="detail"]/a[@class="item-name J_TGoldData"]').text
            price = li.find_element_by_xpath('.//dd[@class="detail"]/div/div[@class="cprice-area"]').text
            xiaoliang = li.find_element_by_xpath('.//dd[@class="detail"]/div/div[@class="sale-area"]/span[@class="sale-num"]').text
            time.sleep(2)
            print(pictuer)
            print(name)
            print(price)
            print(xiaoliang)
            self.data.append([name,price,xiaoliang,pictuer])
        print(self.data)
        # # time.sleep(3)
        

        #第二种方法获取数据    o
        # button = self.browser.page_source
        # # print(button)
        # soup = BeautifulSoup(button,'html.parser')
        # merchandise_news = soup.find_all('div',class_="item5line1")
        # print(merchandise_news)

接下来就是翻页了，我之前是看了总页数的，所以用for循环，这个地方我偷懒了哈哈。上代码：

def get_fanye(self):
        for i in range(2,10):
            self.browser.get('https://chenguang.tmall.com/search.htm?spm=a1z10.3-b-s.w4011-14939493465.411.446b3a16dEtqdy&search=y&orderType=hotsell_desc&pageNo={}&tsearch=y#anchor'.format(i))#这里面我用的是按销量排序好后的URL
            time.sleep(5)
            self.get_pictures()
            print('第'+str(i)+'抓取完成')