【Python】爬虫基础练习(jd wap为例)-selenium+详情页

最新推荐文章于 2022-10-16 10:42:13 发布

冰淇淋和慕斯蛋糕

最新推荐文章于 2022-10-16 10:42:13 发布

阅读量416

点赞数

分类专栏： python代码实例文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_45721997/article/details/125032564

版权

python代码实例专栏收录该内容

11 篇文章 0 订阅

订阅专栏

仅作简单练习

京东(wap)一面有60个商品，打算就爬前15面的商品，大概900个

粗略分为这几步：
1.打开搜索主页：确认关键词，获取基础url
2.提取关键信息
2.1 借助selenium模拟浏览器下拉，得到完整的页面
2.2 定位商品名称、价格、商品链接等信息
3.点击链接，跳转到单个商品页面，定位详细具体信息存储信息
4.使用dataframe存储，输出为excel

import string
from bs4 import BeautifulSoup
import re  
import urllib.request, urllib.error 
from urllib.parse import quote
from selenium import webdriver 
import time
import pandas as pd

#获取原页面
def getUrl(Url):
    head = {}
    head["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    page_url = quote(Url, safe=string.printable)  
    print('当前的page url:',page_url)

		#这里需要一个火狐浏览器驱动，可以百度搜怎么搞。
    path1 = '你的地址/geckodriver.exe'
    driver = webdriver.Firefox(executable_path=path1)
    driver.get(page_url)
    driver.maximize_window()
    time.sleep(2)
    
    js = 'window.scrollTo(0,document.body.scrollHeight)'
    driver.execute_script(js)
    time.sleep(2)
    js = 'window.scrollTo(0,document.body.scrollTop=0)'
    driver.execute_script(js)
    html=driver.page_source
    return html

#获取商品页面（request）省略了。

#获取数据
def getData(baseUrl, page):
    dataFrame_total=pd.DataFrame(columns=['url','store','price','detail'],index=range(0,1000))
    for i in range(1, int(page)):
        url = baseUrl + str(i)
        html = getUrl(url)
        soup = BeautifulSoup(html, "html.parser")

        for item in soup.find_all("li", class_="gl-item"): #找到每一个商品
            data = []
            item = str(item)
            #商品名等...
            # 商品网址
            Src =re.findall('<div class="p-name p-name-type-2">\n<a.*?href="(.*?)" οnclick=.*?>',item)
            item_url = 'https:' + Src[0]  # 可以点击的网页链接
            data.append(item_url)

            #商品店铺
            store = re.findall(findStore, item)[0]
            print('store:',store)
            data.append(store)

            #商品价格
            try:
                price = findPrice.findall(item)
                print('price:',price[0]) #得到这个商品的价格
                data.append(price[0])
            except:
                print('PRICE FOR WRONG:',item)
                price = re.findall('<i>(.*?)</i>',item)
                 # 得到这个商品的价格
                data.append(price[0])

            #获得详情页的信息
            html_item =getItemUrl(item_url)
            time.sleep(2)
            item_soup=BeautifulSoup(html_item, "html.parser")
            #print('详细信息')
          ul_info=item_soup.find_all("ul",class_='parameter2')[0]

            ul_info=str(ul_info)
            detail=findDetail.findall(ul_info)
            data.append(detail)

         	dataFrame_total.iloc[count,0:7] = data

					#因为个别商品标签可能会出问题，比如price那边经常出现<i>价格</i>这种和之前不一样的形式，容易报错，为了偷懒就直接用try\except了，可以在try的时候再详细写写情况。
            try:
                item_name=re.findall(r'<li.*?>商品名称：(.*?)</li>',ul_info)
                dataFrame_total.loc[count,'商品名称']=item_name
            except:
                dataFrame_total.loc[count, '商品名称'] =''
            try:
                item_weight = re.findall(r'<li.*?>商品毛重：(.*?)</li>', ul_info)
                print('item_weight:', item_weight)
                dataFrame_total.loc[count, '商品毛重'] = item_weight
            except:
                dataFrame_total.loc[count, '商品毛重'] =''
            try:
                screen_size = re.findall(r'<li.*?>屏幕尺寸：(.*?)</li>', ul_info)
                print('screen_size:', screen_size)
                dataFrame_total.loc[count, '屏幕尺寸'] = screen_size
            except:
                dataFrame_total.loc[count, '屏幕尺寸'] =''
            
            try:
                refresh = re.findall(r'<li.*?>.*?刷新率.*?：(.*?)</li>', ul_info)
                print('refresh:', refresh)
                dataFrame_total.loc[count, '刷新率'] = refresh
            except:
                dataFrame_total.loc[count, '刷新率'] =''
                
            
            try:
                InterMemory = re.findall(r'<li.*?>.*?内存.*?：(.*?)</li>', ul_info)
                print('InterMemory:', InterMemory)
                dataFrame_total.loc[count, '内存'] = InterMemory
            except:
                dataFrame_total.loc[count, '内存'] =''

				#省略了一些，反正都差不多这么个意思

        time.sleep(1)
        return dataFrame_total#返回这个数据表。

def main():
    
    baseUrl = "https://search.jd.com/Search?keyword=" + '笔记本电脑' + "&page="

	  page = 16

    dataFrame_total = getData(baseUrl, page)
    dataFrame_total.to_excel('datap4p15.xlsx')

结果部分如下图表示
在这里插入图片描述

冰淇淋和慕斯蛋糕

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【Python】爬虫基础练习(jd wap为例)-selenium+详情页

仅作学习练习python使用。jd(wap)一面有60个商品，有100页。打算就爬前15面的商品练一下能跑通就行。主要用selenium\dataframe粗略分为这几步：1.打开搜索主页
复制链接

扫一扫