python + selenium多进程爬取淘宝搜索页数据

最新推荐文章于 2024-04-30 10:40:38 发布

Kosmoo

最新推荐文章于 2024-04-30 10:40:38 发布

阅读量4.2k

点赞数 4

分类专栏： selenium 文章标签： selenium滚动条 chrome滚动条滚动条下滑

本文链接：https://blog.csdn.net/zwq912318834/article/details/81189422

版权

python + selenium多进程爬取淘宝搜索页数据

1. 功能描述

按照给定的关键词，在淘宝搜索对应的产品，然后爬取搜索结果中产品的信息，包括：标题，价格，销量，产地等信息，存入mongodb中，需要采用多进程提高爬取效率。

2. 环境

系统：win7
MongoDB 3.4.6
python 3.6.1
IDE：pycharm
安装过chrome浏览器（63.0.3239.132 (正式版本) 32 位）
selenium 3.7.0
配置好chromedriver v2.34

3. 代码


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

import pymongo
import time
import datetime

import re
import multiprocessing

import lxml.html
import lxml.etree

# ---------- 1. 一些配置信息  ------------
# 搜索关键字列表
keySearchWords = {
   
    "动漫": [1, "动漫周边"],
    "水果": [1, "水果沙拉"],
}

# 数据库初始化
client = pymongo.MongoClient("127.0.0.1:27017")
db = client["taobao"]
db_coll = db["productInfo"]

# 一个页面的最大重试次数
retryMax = 8

chrome_options = webdriver.ChromeOptions()
# 禁止图片和视频的加载，提高网页爬取速度
# prefs = {"profile.managed_default_content_settings.images": 2}
# chrome_options.add_experimental_option("prefs", prefs)

# 启用headless模式：无浏览器界面，提高速度与稳定性
# chrome_options.add_argument('--headless')
# chrome_options.add_argument('--disable-gpu')


# ---------- 2. 解析页面信息  ------------
# 获得每个产品在list页面的主要信息
def getProductMainInfo(htmlSource):
    try:
        resultTree = lxml.etree.HTML(htmlSource)
        # fix_html = lxml.html.tostring(resultTree, pretty_print=True)
        # print(f"htmlSource = {htmlSource}")

        productLst = resultTree.xpath("//div[@class='m-itemlist']//div[contains(@class, 'J_MouserOnverReq')]")
        print(f"productLst = {productLst}")
        productInfoLst = []
        for product in productLst:
            productInfo = {
   }

            # 唯一标记
            dataNid = product.xpath(".//div[contains(@class,'ctx-box')]//div[contains(@class, 'title')]/a/@data-nid")
            if len(dataNid) > 0:
                productInfo['dataNid'] = dataNid[0]
            else:
                productInfo['dataNid'] = 0
            productInfo['_id'] = productInfo['dataNid']

            taobaoCategory = product.xpath("@data-category")
            if len(taobaoCategory) > 0:
                productInfo['taobaoCategory'] = taobaoCategory[0]
            else:
                productInfo['taobaoCategory'] = 'unknow'

            rank = product.xpath("@data-index")
            if len(rank) > 0:
                productInfo['rank'] = rank[0]
            else:
                productInfo['rank'] = 0

            imgSrc = product.xpath(".//div[@class='pic']/a//img/@src")
            if len(imgSrc) > 0:
                productInfo['imgSrc'] = imgSrc[0]
            else:
                productInfo['imgSrc'] = ''

            title = product.xpath(".//div[contains(@class,'ctx-box')]//div[contains(@class, 'title')]/a/text()")
            productInfo['title'] = ''
            if len(title) > 0:
                for elem in title:
                    productInfo['title'] += elem.strip()

            detailUrl = product.xpath(".//div[contains(@class,'title')]//a/@href"

最低0.47元/天解锁文章

Kosmoo

关注

4
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
python + selenium多进程爬取淘宝搜索页数据

selenium操作chrome滑动滚动条的几种方法分析页面滚动条滑动self.driver.execute_script(“window.scrollTo(0,document.body.scrollHeight);”) # 将页面滚动条滑到底部 self.driver.execute_script(“arguments[0].scrollIntoView();”, el) # 向...
复制链接

扫一扫

专栏目录