利用selenium爬取淘宝商品信息

最新推荐文章于 2024-05-16 16:37:49 发布

南晨Inc

最新推荐文章于 2024-05-16 16:37:49 发布

阅读量1k

点赞数 2

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_36035111/article/details/104581534

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

在淘宝上买东西的时候常常要比较商品的价格，付款的人数等等，若如一个个查看是十分费时的。所以这次就利用爬虫爬取淘宝商品的信息。

首先介绍下selenium，selenium本是用于web自动化的工具。但是其在浏览器上直接运行，就像真正的用户在操作一样。所以也常常用于爬虫。selenium课可直接寻找到web页面中的元素，如页码输入框，确认按钮等等，并且可以对这些元素进行操作，比如点击，清除。

搜索步骤如下：

1.首先利用selenium在搜索框中输入某个关键词（这个动作可以直接用url进行拼接，所以不需要使用selenium中的参数传递）。

2.跳转之后在跳转页面输入框中输入要跳转的页码，并点击确定（如上图）。然后循环进行第二步。

页面索引代码如下

browser = webdriver.Chrome()
#显示等待，等待browser页面10秒
wait = WebDriverWait(browser,10)
#关键词
keyword = "macbook"

#索引页
def index_page(page):
    print("正在爬取第",page,"页")
    try:
        #url拼接
        url = "https://s.taobao.com/search?q=" + quote(keyword)
        browser.get(url)
        #第一页不用做如下动作（也可以做，但是没有必要）
        if page > 1:
            #wait.until():等待指定的选择器加载出来
            input_page = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR,'.m-page.g-clearfix div.form > input')))
            submit = wait.until(ec.element_to_be_clickable((By.CSS_SELECTOR,'.m-page.g-clearfix div.form > span.btn.J_Submit')))
            #清除输入框
            input_page.clear()
            #传送参数
            input_page.send_keys(page)
            #点击确定按钮
            submit.click()
        #等待指定的文本(即页码)出现在对应的元素中
        wait.until(ec.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
        #等待商品加载出来
        wait.until(ec.presence_of_element_located((By.CSS_SELECTOR,'.m-itemlist .items .item')))
        #获得商品信息
        #get_products()
    except TimeoutException:
        print('error')
        #若出现错误则再次运行此函数
        index_page(page)

当淘宝页面加载时需要等待部分选择器，元素加载出来，否则会出现错误而导致无法正确爬取。当商品全部加载出来之后，就可以爬取商品的相相关信息了。

爬取商品信息代码：

def get_products():
    html = browser.page_source
    doc = pq(html)
    items = doc('#mainsrp-itemlist .items .item').items()
    for item in items:
        #价格
        price = item.find('.price').text().replace('\n','')
        #付款人数，付款人数的位置可能为空白，所以当位置为空时，deal设为0人付款
        deal = item.find('.deal-cnt').text()
        if deal == '':
            deal = "0人付款"
        #商品标题，标题包含了许多的空格及换行符
        title = item.find('.title').text().strip().replace('\n','')
        #商店名
        shop = item.find('.shop').text()
        #商店位置
        location = item.find('.location').text()
        #写入文件
        writer.writerow([price,deal,title,shop,location])

所有代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import TimeoutException
from urllib.parse import quote
from pyquery import PyQuery as pq
import csv

browser = webdriver.Chrome()
#显示等待，等待browser页面10秒
wait = WebDriverWait(browser,10)
#关键词
keyword = "macbook"

#索引页
def index_page(page):
    print("正在爬取第",page,"页")
    try:
        #url拼接
        url = "https://s.taobao.com/search?q=" + quote(keyword)
        browser.get(url)
        
        if page > 1:
            #wait.until():等待指定的选择器加载出来
            input_page = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR,'.m-page.g-clearfix div.form > input')))
            submit = wait.until(ec.element_to_be_clickable((By.CSS_SELECTOR,'.m-page.g-clearfix div.form > span.btn.J_Submit')))
            #清除输入框
            input_page.clear()
            #传送参数
            input_page.send_keys(page)
            #点击确定按钮
            submit.click()
        #等待指定的文本(即页码)出现在对应的元素中
        wait.until(ec.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager li.item.active > span'),str(page)))
        #等待商品加载出来
        wait.until(ec.presence_of_element_located((By.CSS_SELECTOR,'.m-itemlist .items .item')))
        #获得商品信息
        get_products()
    except TimeoutException:
        print('error')
        #若出现错误则再次运行此函数
        index_page(page)

def get_products():
    html = browser.page_source
    doc = pq(html)
    items = doc('#mainsrp-itemlist .items .item').items()
    for item in items:
        price = item.find('.price').text().replace('\n','')
        deal = item.find('.deal-cnt').text()
        if deal == '':
            deal = "0人付款"
        title = item.find('.title').text().strip().replace('\n','')
        #title = 
        shop = item.find('.shop').text()
        location = item.find('.location').text()
        
        writer.writerow([price,deal,title,shop,location])
    
if __name__ == "__main__":
    file = open("taobao.csv","a",encoding="utf-8",newline="")
    writer = csv.writer(file)
    
    for i in range(1,5):
        print(i)
        index_page(i)
    print("结束")
    file.close()

结果如下：

南晨Inc

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
利用selenium爬取淘宝商品信息

在淘宝上买东西的时候常常要比较商品的价格，付款的人数等等，若如一个个查看是十分费时的。所以这次就利用爬虫爬取淘宝商品的信息。首先介绍下selenium，selenium本是用于web自动化的工具。但是其在浏览器上直接运行，就像真正的用户在操作一样。所以也常常用于爬虫。selenium课可直接寻找到web页面中的元素，如页码输入框，确认按钮等等，并且可以对这些元素进行操作，比如点击，清除。搜...
复制链接

扫一扫