python3+selenium爬取笔记本电脑详情信息

最新推荐文章于 2023-05-25 16:14:04 发布

张先生r

最新推荐文章于 2023-05-25 16:14:04 发布

阅读量1.3k

点赞数 1

分类专栏：爬虫 python3 文章标签： selenium python chrome

本文链接：https://blog.csdn.net/weixin_49738000/article/details/111941112

版权

爬虫同时被 2 个专栏收录

5 篇文章 3 订阅

订阅专栏

python3

5 篇文章 2 订阅

订阅专栏

python3+selenium爬取购物商店

准备工作

#　用到的包
selenium　#Web自动化测试工具
urllib　#URL地址中查询参数进行编码
 xlwt  # 存储execl文件
 time # 加载数据缓冲时间

selenium未安装的可通过以下方式安装

Linux:

 sudo pip3 install selenium

Windows:

 python -m pip install selenium

浏览器驱动　需提前下载驱动
chromedriver : 下载与浏览器对应版本
谷歌浏览器
geckodriver
火狐浏览器
添加到系统环境变量
1.1) windows: 将解压后的可执行文件拷贝到Python安装目录的Scripts目录中
windows查看python安装目录(cmd命令行)：
```
where python
```
1.2) Linux : 将解压后的文件拷贝到/usr/bin目录中
```
sudo cp chromedriver /usr/bin/
```

分析

url

这里我们比较一下搜索栏的url太过长？删减一下
在这里插入图片描述
经过测试发现url的规律

https://search.jd.com/Search?keyword=搜索的关键词

xpath

获取需要的xpath
1.F12 点击左上角选中我们需要找的内容
F12
选中后的样式
搜索框

然后我们选中被标亮的代码部分点击鼠标右键
搜索xpath
选择xpath —>copy xpath
拿取我们需要的xpath

//*[@id="key"]

同理方法获取搜索按钮的xpath

搜索xpath
获取搜索的xpath

//*[@id="search"]/div/div[2]/button

单个详情的url的xpath获取
在这里插入图片描述
单个电脑信息的url的xpath

//*[@id="J_goodsList"]/ul/li[1]/div/div[1]/a

比较一下各个详情的xpath的区别

 //*[@id="J_goodsList"]/ul/li[2]/div/div[1]/a
//*[@id="J_goodsList"]/ul/li[1]/div/div[1]/a

总结出的xpath

//*[@id="J_goodsList"]/ul/li/div/div[1]/a

xpath的语法　可以先了解一下xpath语法

//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置
＠	选取属性

这里选取href属性

//*[@id="J_goodsList"]/ul/li/div/div[1]/a/@href

需要获取的内容分这几种情况　京东价/有预售/待发布　已发布
代发布

根据不同情况使用不同的xpath获取需要数据

# try　防止获取的数据不存在报错
# 型号
            model = self.driver.find_element_by_xpath('//div[@class="sku-name"]').text

            # 售价
            try:
                price = self.driver.find_element_by_xpath('//div[@class="dd"]/span[@class="p-price"]').text
            except Exception as error:
                # 不存在默认0000
                price = '0000'

            # 定金
            try:
                dj_price = self.driver.find_element_by_xpath(
                    '//div[@class="dd"]/span[@class="p-price dj-price"]').text
            except Exception as error:
                dj_price = '0000'

            # 预售
            try:
                ys_price = self.driver.find_element_by_xpath('//div[@class="dd"]/span[@class="p-price ys-price"]').text
            except Exception as error:
                ys_price = '0000'

附完整代码

# -*- coding: UTF-8 -*-

"""
采用的是
python3.6

selenium + chromedriver ＋　Firefox　浏览器　
selenium + geckodriver ＋ Chrome   浏览器

均可使用

"""
from selenium import webdriver
from urllib import parse
import xlwt
import time


class JdSpider:
    def __init__(self):
        self.url = 'https://www.jd.com/'  # 主页面url
        self.word_url = 'https://search.jd.com/Search?keyword='  # 搜素  筛选后的url
        # 无界面浏览器
        # self.options=webdriver.FirefoxOptions()
        # self.options.add_argument('--headless')
        # self.driver=webdriver.Firefox(options=self.options)

        # 有页面浏览器
        # self.driver = webdriver.Chrome()
        #创建浏览器对象
        self.driver = webdriver.Firefox()
		# 输入你需要搜索的关键词
        self.word = input('请输入搜索:>>')

        # # 创建工作薄
        self.f = xlwt.Workbook()
        # 创建一个sheet
        self.sheet = self.f.add_sheet(self.word, cell_overwrite_ok=True)
        # # 型号 单价 预售 定金
        self.data = []
        # 定义每一列大小
        self.col1 = self.sheet.col(0)
        self.col1.width = 260 * 100
        self.col2 = self.sheet.col(1)
        self.col2.width = 260 * 20

        self.col3 = self.sheet.col(2)
        self.col3.width = 260 * 20

        self.col4 = self.sheet.col(3)
        self.col4.width = 260 * 20

    def get_html(self, url):
        # 打开网页
        self.driver.get(url=url)

    def crawl(self):
        self.get_html(url=self.url)

        # 1. 找到搜索框
        # 2.输入搜索的关键词
        self.driver.find_element_by_xpath('//*[@id="key"]').send_keys(self.word)
        # 3.找到搜索点击搜索
        self.driver.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click()

    def scroll_to_bottom(self):
        # 执行这段代码，会获取到当前窗口总高度
        js = "return action=document.body.scrollHeight"
        # 初始化现在滚动条所在高度为0
        height = 0
        # 当前窗口总高度
        new_height = self.driver.execute_script(js)

        while height < new_height:

            # 将滚动条调整至页面底部
            for i in range(height, new_height, 200):
                self.driver.execute_script('window.scrollTo(0, {})'.format(i))
                # 给予下拉数据缓冲时间
                time.sleep(0.5)
            height = new_height
            time.sleep(2)
            new_height = self.driver.execute_script(js)

    def search(self):

        self.crawl()
        # 地址栏中的url中字符串需要编码
        url_word = parse.quote(self.word)
        # 拼接url
        self.get_html(url=self.word_url + url_word)

        # 滚动下拉条
        self.scroll_to_bottom()

        li_list = self.driver.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li/div/div[1]/a')

        # 获取详情的url
        second_list = []
        for page_url in li_list:
            #获取url
            second_url = page_url.get_attribute('href')
            second_list.append(second_url)

        self.item(second_list)

    # 处理内层
    def item(self, second_list):
        # 遍历获取
        for item_url in second_list:
            self.get_html(url=item_url)
            # 获取需要数据
            # 使用　selenium爬取　xpath使用copy xpath的　
            # 型号
            model = self.driver.find_element_by_xpath('//div[@class="sku-name"]').text

            # 售价
            try:
                price = self.driver.find_element_by_xpath('//div[@class="dd"]/span[@class="p-price"]').text
            except Exception as error:
                # 不存在默认0000
                price = '0000'

            # 定金
            try:
                dj_price = self.driver.find_element_by_xpath(
                    '//div[@class="dd"]/span[@class="p-price dj-price"]').text
            except Exception as error:
                dj_price = '0000'

            # 预售
            try:
                ys_price = self.driver.find_element_by_xpath('//div[@class="dd"]/span[@class="p-price ys-price"]').text
            except Exception as error:
                ys_price = '0000'

            # 替换数据中￥　
            data_list = [model, price.replace('￥', ''), dj_price.replace('￥', ''), ys_price.replace('￥', '')]

            self.data.append(data_list)
        # 保存
        self.save()

    def save(self):

        # 存储数据
        print(self.data)

        # 初始化第一行作为列名
        row0 = ['型号','单价','定金','预售']
        # # 写入第一行为列名
        for i in range(0, len(row0)):
            self.sheet.write(0, i, row0[i])  # 在第0行第一列

        # 同理剩下的数据按照上面的方式写入

        for i in range(len(self.data)):
            for j in range(len(self.data[i])):
                # 因为第0行为列名,所以这里下标i+1
                self.sheet.write(i + 1, j, self.data[i][j])  ##第i+1行第j列写入数据data[i][j]

        # 保存文件
        self.f.save(f'{self.word}.xls')

    def run(self):
        '''执行函数'''
        self.search()

        # 关闭浏览器
        self.driver.quit()


# 程序的执行入口
if __name__ == '__main__':
    jd = JdSpider()
    jd.run()

在这里插入图片描述

写到最后这篇文章对你有帮助吗？在评论区留下你的困惑或你的见解，大家一起来交流吧！我们下期见！

张先生r

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
python3+selenium爬取笔记本电脑详情信息

python3+selenium爬取京东准备工作分析urlxpath附完整代码准备工作#　用到的包selenium　#Web自动化测试工具urllib　#URL地址中查询参数进行编码 xlwt # 存储execl文件 time # 加载数据缓冲时间selenium未安装的可通过以下方式安装Linux: sudo pip3 install seleniumWindows: python -m pip install selenium浏览器驱动　需提前下载驱动chrom
复制链接

扫一扫

专栏目录