selenium爬取某东商城

最新推荐文章于 2022-04-14 17:20:46 发布

LIUZY615

最新推荐文章于 2022-04-14 17:20:46 发布

阅读量282

点赞数

分类专栏：练习文章标签： selenium

本文链接：https://blog.csdn.net/liu_ziyue/article/details/104513166

版权

练习专栏收录该内容

15 篇文章 1 订阅

订阅专栏

Selenium是一个用于Web应用程序测试的工具，直接运行在浏览器中，就像真正的用户在操作一样。
对于selenium和webdriver安装和配置不再赘述，我使用的是chromedriver。

chromdriver下载好后要在环境变量中进行配置，具体方法可也去查一查，另外版本一定要和自己浏览器版本一致，如果配置错误就不要纠结了，写一个绝对路径就可以了。

import pymongo
import time,re
from lxml import etree
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(r'C:****\chromedriver.exe')
wait = WebDriverWait(driver, 50)
client = pymongo.MongoClient('localhost', 27017)
db = client['JD']

以下是思路及代码

进入商城主页，在搜索框中输入需求商品，并点击右侧搜索按钮触发搜索。
在这里插入图片描述
之后定位销量按钮，并点击。

销量跳转成功后，将页面拉至网页底部并获取总页数。

以上步骤各按钮可通过F12 elements获取到，我这里使用的是Xpath解析，xpath对于页面有很强大的解析能力，同时推荐给各位一个chrome关于xpath的插件–xpath Helper，非常好用在页面中就能测试xpath语句写的对不对。当然也可以使用chrome自带的xpath搜索，缺点就是过于冗长。
以下是我这一部分代码

def Auto_BaseHtml(word):
    driver.get('某东url')
    try:
        #定位输入框
        input_wd = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"#key")))
        #定位搜索按键
        button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"div.search-m > div.form > button")))
        input_wd[0].send_keys(word)
        button.click()
        #定位销量
        # xiaoliang = wait.until((EC.element_to_be_clickable((By.CSS_SELECTOR,"#J_filter > div.f-line top > div.f-sort > a:nth-last-child(2)"))))
        # xiaoliang.click()
        xiaoliang = wait.until(EC.element_to_be_clickable((By.XPATH,"//div[@class='f-line top']//div[@class='f-sort']//a[2]")))
        xiaoliang.click()
        #获取总页数
        pages = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"#J_bottomPage > span.p-skip > em > b")))
        return pages[0].text
    except TimeoutException:
        Auto_BaseHtml()

接下来就是翻页功能了，很简单，就是点击下一页，并拉至页面底部，值得注意的是，点击下一页后要停留一下在拉页面，以防止页面没有加载成功。

def next_page(page_max):
    # page_max = 5
    for page in range(1, page_max+1):
        print("正在采集第{0}页.... ....".format(str(page)))
        #执行浏览器滑到底部的js语句
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
        html = driver.page_source
        Parse_Data(html)
        if page == page_max:
            print("已获取到最后一页")
            driver.quit()
            exit()
        #点击下一页
        page_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//div[@id='J_bottomPage']//a[@class='pn-next']")))
        page_button.click()
        time.sleep(2)
        #判断是否翻页成功
        res = wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,"#J_bottomPage > span.p-num > a.curr"),str(page+1)))
        if res == 'False':
            print("未成功翻页！")
            driver.quit()
            exit()

接下来就是页面解析了，这里我依然使用xpath，解析的对象如下，具体的xpath解析不帖了，因为这个练习写于2019年9月，某东应该对页面有了重新的部署。

datas = {
            "name":phone.strip(),
            "price":int(float(price[0])),
            "collor":collor,
            "shop":shop,
            "commiters":commiter,
            "AD":AD[0].strip(),
            "link":link[0].strip()
        }

数据库我使用的是mongodb，不需要建表，干就完了

def To_Mongo(data):
    collection = db['phones']
    try:
        collection.insert_many(data)
        print("写入成功！")
    except:
        print("写入失败！")