【python爬虫】做一个简单的有界面有进度条的爬取京东商城的爬虫

最新推荐文章于 2024-09-18 11:41:24 发布

天涯望月羊

最新推荐文章于 2024-09-18 11:41:24 发布

阅读量1.8k

点赞数 5

文章标签： python 爬虫经验分享

本文链接：https://blog.csdn.net/starry_sheep/article/details/103740078

版权

做一个简单的有界面的爬虫

开头
所有代码
不足
最后

开头

最近期末考试阶段，突然要写一个爬虫作业。我之前就学过一点python的基础语法，更不用说爬虫了。一开始有点觉得犯难了，但看了几篇爬虫基础入门相关的博客后，大概有了一些了解。

大概花了一天时间，把爬虫的主要代码写好了，后来又陆陆续续改了改，之后觉得纯代码不太美观，就看了一下 Tkinter 教程，大概做成了如图这样的效果：
在这里插入图片描述

所用到的库

# 用于制作界面
import tkinter as tk
import tkinter.messagebox
# 爬虫需要用到的
import requests 
from bs4 import BeautifulSoup
# 用于保存爬取到的数据到csv文件
import csv
# 设置随机访问时间，访问过快会有被封ip的风险
import time
import random

之后就开始想应该怎么爬了。我当时思路是先通过修改京东搜索结果页面的url中的关键字来得到搜索结果，之后获取第一页上所有的商品数据列表，然后进入循环获取每个商品的评论相关数据，并把每个商品的数据保存到一个列表中，最后把列表里的数据保存到csv文件中。

设置请求头

headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
        'Cookie':'__jdu=1575851007; shshshfpa=4478fa62-1038-a220-4758-a8f8fb33a94c-1565097369; shshshfpb=x4hKWK42Csod6TvzZu84ImA%3D%3D; areaId=12; ipLoc-djd=12-988-47821-0; user-key=32b86a54-cde1-4b0f-8e7f-dad6f1b932f7; cn=0; TrackID=1c4B8tjX-r2XhASx8iu7e5adXdhpDZI0u3dMCCodDdou1g-RjW53S1eGqwicdXj6h1vYP5bxG9VVna9UGOmbDmPqyTXCHzZxT4IXxSmGpt6I; pinId=hFytr_e6V50nLHoBcZu_IrV9-x-f3wj7; pin=jd_7b68875ae5a80; unick=jd_7b68875ae5a80; _tp=Hzv76TmStPIKW%2FuZQ7veOZR7pHLLLv2C9lI6Xf24Dlg%3D; _pst=jd_7b68875ae5a80; unpl=V2_ZzNtbUMFQhFyXU9deRlYUWIGFFwSA0cUcQxFXXofW1I1ABIPclRCFX0URldnGlkUZwcZX0dcQhFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zElEUQXALRlIuTQ9XYlATXUpeQRJwCxYGc0lfBjIDGghyZ0AVRQhHZHsdWgFuBxJcQ1FzJXI4dmR%2fG1kDYgoiXHJWc1chVE9deRBcACoDFltGXkcVdAlAZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|a7fe4217debc4b01983642ac9a22d19d_0_0c056d88315e4470ae505528076fc21c|1577172844712; PCSYCityID=CN_320000_320500_320583; __jda=122270672.1575851007.1560329455.1577160883.1577172845.9; __jdc=122270672; 3AB9D23F7A4B3C9B=ZXRLLNONSMWITKSVYZS6PQJHXX7TJJVTFRAME4IE435S7TKFIZYO7S4HIG46CHUCDDT3P6XZ4CYML7RFLRDLB44REY; shshshfp=04eaf5816ad00cfbdea0bd85ee561794; __jdb=122270672.5.1575851007|9.1577172845; shshshsID=2875abb181079938819e4abd65980d36_3_1577173924061'
    }

请求头可以在 DevTools 里的 Network 里找到,这里我就复制了User-Agent和Cookie数据,因为发现免费的ip代理池很容易出问题,所以没有弄ip代理池,也没有弄什么随机请求头。。。

getUrlContent

def getUrlContent(url,headers):
    # 获取 url 页面,返回 response.content
    #print('获取网页源码中')
    try:
        response = requests.get(url,headers = headers)
        #print('页面获取状态码:' + str(response.status_code))
        if(response.status_code == 200):
            response.encoding = 'utf-8'
            return response.content
        #print('页面获取失败,状态码:' + str(response.status_code))
        return None
    except:
        return None

分析网页结构

这里看到，京东搜索结果页面里，所有商品就是一个ul，每个商品对应一个li，并且li里就有商品id的属性，那就开始获取商品列表了
在这里插入图片描述

r = getUrlContent(url,headers=headers)
page = BeautifulSoup(r,'lxml')
itemList = page.find_all('li',class_='gl-item')

其实 BeautifulSoup这步放在前面定义的 getUrlContent 函数中也是可以的，但当时忘记改了，交完了作业才发现。。。

现在就已经获取到这一页上所有商品的li列表了，接下来就是进入循环，获取每个li里的商品相关数据了

获取商品id、网址、名称、价格

goodsList = []
for i in itemList:
    # 设置随机访问时间
    time.sleep(random.random() * 3)
    thing_id = i['data-sku'] # 商品id
    thing_url = i.find('div',attrs={'class':'p-name p-name-type-2'}).find('a')['href'] # 商品网址
    if('http' not in thing_url ):
        thing_url = 'https:' + thing_url
    thing_name = i.find('div',attrs={'class':'p-name p-name-type-2'}).find('em').text # 商品名称
    thing_price = i.find('div',attrs={'class':'p-price'}).find('i').text # 商品价格
   if(thing_price == ''):
       try:
           thing_price = i.find('div',attrs={'class':'p-price'}).find('strong')['data-price']
           print(thing_price)
       except:
           thing_price = '价格获取失败'

有些商品的价格获取不到，所以又在li标签里找了一下，发现还有一个strong标签里有一个 data-price 属性里面同样有价格数据，还有就是获取到的商品网址格式不一，有些网址自带了“https:”，有些有没有带，所以做了一个简单的判断。

这样，就获取到了商品的一部分数据了（id、url、名字、价格）

获取商品的评价数据

接下来就是获取每个商品的评价数和好评率了，这里我看了一下Network里，发现有一个文件里保存了每个商品的评价数据
在这里插入图片描述
打开后，就获取到了这个url：
https://club.jd.com/comment/productCommentSummaries.action?referenceIds={商品id}
这个url打开后是一个json类型的数据

所以就需要把获取到的网页结果转成json格式后再获取，这样更方便获取里面的数据：

......
commentURL = f"https://club.jd.com/comment/productCommentSummaries.action?referenceIds={thing_id}"
comment_count,comment_GoodRate = get_item_json(commentURL)
......
def get_item_json(url):
    # 获取 单个商品 的 评论数 和 好评率
    try:
        itemJson = requests.get(url).json()
        result = itemJson['CommentsCount']
        for i in result:
            return i['CommentCountStr'],i['GoodRateShow']
    except:
        return 'error','error'

这样，就获取到了每个商品的评价数和好评率了

获取商品相关评论

然后就是依次获取每一个商品的相关评价了

......
# 获取商店名和相关评论
shopName,MainComments = get_item_comments(thing_url,headers)
......
def get_item_comments(url,headers):
    # 获取 单个商品 的评论
    try:
        r = getUrlContent(url,headers)
        soup = BeautifulSoup(r,'lxml')
        shopName = soup.find('div',attrs={'class':'J-hove-wrap EDropdown fr'}).find('a')['title']
        commentList = soup.find('div',attrs={'id':'hidcomment'}).find_all('div',class_='item')
        comments = ''
        for i in commentList:
                buyDay = i.find('div',class_='date-buy').text[4:]
                mainComment = i.find('a').text
                oneComment = f'''购买时间:{buyDay}
                主要评论:{mainComment}
                ---------------------------------------------------------
                '''
                comments += oneComment
        return shopName,comments
    except:
        try:
            r = getUrlContent(url,headers)
            soup = BeautifulSoup(r.content,'lxml')
            shopName = soup.find('div',attrs={'class':'J-hove-wrap EDropdown fr'}).find('a')['title']
            return shopName,'暂无评论'
        except:
            return '无商家信息','暂无评论'

这里,因为有些商品网页结构不太一样,防止程序报错停止运行,就用了try来区分了一下

这里只能获取url为https://item.jd.com/{商品id}.html这样格式的评论,如果要获取其他像url格式为https://item.jd.hk/{商品id}.html这样的评论就需要再写详细一点了.
把获取到的评价做一下字符串的拼接,然后再返回就得到了商品的商家名和相关评价了

保存数据到csv中

做完这些后,就是把获取到的所有商品数据保存到csv里了

......
url = f'https://search.jd.com/Search?keyword={key}&enc=utf-8&page={count}'
data_list = find_things(window,canvas,int((count+1)/2),url,headers=headers)
writeCSV(key,data_list) 
......
def writeCSV(thingName,dataList):
    # 创建 csv 文件,把 商品列表 写入 csv 文件
    now = time.strftime("%Y-%m-%d", time.localtime()) # 获取一下当前时间 
    with open(f'GoodsList_{thingName}_{now}.csv','w',newline='',encoding='utf-8') as f:
        writer = csv.writer(f)
        tableHeader = ('ID','商品网址','名称', '价格','评价数','好评率','商家','相关评论')
        writer.writerow(tableHeader)
        for rowInfor in dataList:
            writer.writerow(rowInfor)
        print('商品列表整理完毕')
        tk.messagebox.showinfo(title=theName, message='商品列表整理完毕')

所有代码

爬取过程大概就是这样了,接下就是加入界面了,因为是交作业前一天晚上才看了一下 tkinter ,所以自己现在也不是很了解这个…就直接发一下自己写的代码吧…当时,做这个进度条卡了一会儿,其他的自己写一写应该就能懂个大概吧

这里是所有的代码(才刚接触,所以有很多不清楚的地方,希望大佬能看一下有什么可以改进的地方):


import tkinter as tk
import tkinter.messagebox
import requests 
import csv
import time
import random
import os
from bs4 import BeautifulSoup

def getUrlContent(url,headers):
    # 获取 url 页面,返回 content
    #print('获取网页源码中')
    try:
        response = requests.get(url,headers = headers)
        #print('页面获取状态码:' + str(response.status_code))
        if(response.status_code == 200):
            response.encoding = 'utf-8'
            return response.content
        #print('页面获取失败,状态码:' + str(response.status_code))
        return None
    except:
        return None

def find_things(window,canvas,pageCount,url,headers):
    # 获取商品
    r = getUrlContent(url,headers=headers)
    page = BeautifulSoup(r,'lxml')
    itemList = page.find_all('li',class_='gl-item')
    goodsList = []
    for i in itemList:
        time.sleep(random.random() * 3)
        thing_id = i['data-sku']
        thing_url = i.find('div',attrs={'class':'p-name p-name-type-2'}).find('a')['href']
        if('http' not in thing_url ):
            thing_url = 'https:' + thing_url
        thing_name = i.find('div',attrs={'class':'p-name p-name-type-2'}).find('em').text
        thing_price = i.find('div',attrs={'class':'p-price'}).find('i').text
        if(thing_price == ''):
            try:
                thing_price = i.find('div',attrs={'class':'p-price'}).find('strong')['data-price']
                print(thing_price)
            except:
                thing_price = '价格获取失败'

        commentURL = f"https://club.jd.com/comment/productCommentSummaries.action?referenceIds={thing_id}"
        comment_count,comment_GoodRate = get_item_json(commentURL)

        print(thing_id)
        shopName,MainComments = get_item_comments(thing_url,headers)

        goodsList.append([thing_id,thing_url,thing_name,str(thing_price),comment_count,str(comment_GoodRate) + '%',shopName,MainComments])
        
        # 每添加一条商品信息,进度更新一次
        global nowProgress
        global totalProgress
        
        nowProgress += 1.0
        countDown.set(countDown.get() - 3)
        currentName.set(thing_name[:40] + '...')
        progress(window,canvas,nowProgress/totalProgress)
    return goodsList
        
def get_item_json(url):
    # 获取 单个商品 的 评论数 和 好评率
    try:
        itemJson = requests.get(url).json()
        result = itemJson['CommentsCount']
        for i in result:
            return i['CommentCountStr'],i['GoodRateShow']
    except:
        return 'error','error'

def get_item_comments(url,headers):
    # 获取 单个商品 的评论 (通过 拼接字符串 方式)
    try:
        r = getUrlContent(url,headers)
        soup = BeautifulSoup(r,'lxml')
        shopName = soup.find('div',attrs={'class':'J-hove-wrap EDropdown fr'}).find('a')['title']
        commentList = soup.find('div',attrs={'id':'hidcomment'}).find_all('div',class_='item')
        comments = ''
        for i in commentList:
                buyDay = i.find('div',class_='date-buy').text[4:]
                mainComment = i.find('a').text
                oneComment = f'''购买时间:{buyDay}
                主要评论:{mainComment}
                ---------------------------------------------------------
                '''
                comments += oneComment
        return shopName,comments
    except:
        try:
            r = getUrlContent(url,headers)
            soup = BeautifulSoup(r.content,'lxml')
            shopName = soup.find('div',attrs={'class':'J-hove-wrap EDropdown fr'}).find('a')['title']
            return shopName,'暂无评论'
        except:
            return '无商家信息','暂无评论'

def writeCSV(thingName,dataList):
    # 创建 csv 文件,把 商品列表 写入 csv 文件
    now = time.strftime("%Y-%m-%d", time.localtime()) 
    with open(f'GoodsList_{thingName}_{now}.csv','w',newline='',encoding='utf-8') as f:
        writer = csv.writer(f)
        tableHeader = ('ID','商品网址','名称', '价格','评价数','好评率','商家','相关评论')
        writer.writerow(tableHeader)
        for rowInfor in dataList:
            writer.writerow(rowInfor)
        print('商品列表整理完毕')
        tk.messagebox.showinfo(title=theName, message='商品列表整理完毕')

def progress(window,canvas,progressCent):
    # 填充进度条
    print(stepTime)
    # 随机进度条颜色
    colorList = ['green','yellow','red','blue','gold','darkBlue','pink','purple','skyBlue']
    fill_line = canvas.create_rectangle(1.5, 1.5, 0, 23, width=0, fill=colorList[stepTime])
    canvas.coords(fill_line, (0, 0, progressCent*500, 60))
    window.update()

def clearProgress(window,canvas):
    # 清空进度条
    global nowProgress
    global stepTime
    nowProgress = 0.0
    stepTime = random.randint(0,8)
    fill_line = canvas.create_rectangle(1.5, 1.5, 0, 23, width=0, fill="white")
    canvas.coords(fill_line, (0, 0, 500, 60))
    window.update()

def ok():
    # 确认商品名和页数后开始爬取信息
    key = keyword.get()
    pageNum = pageCount.get()
    if(key == ''):
        tk.messagebox.showinfo(title=theName, message='请输入商品关键字')
    else:
        if(str(pageNum).isdigit() == False):
            tk.messagebox.showinfo(title=theName, message='请输入正确的页数')
        elif(str(pageNum).isdigit() and int(pageNum) <= 0):
            tk.messagebox.showinfo(title=theName, message='请输入正确的页数')
        else:
            global nowProgress
            global totalProgress
            clearProgress(window,canvas)
            pageNum = int(pageNum)
            totalProgress = float(pageNum * 30)
            countDown.set(int(totalProgress * 3))
            progress(window,canvas,nowProgress/totalProgress)
            allList = []
            for count in range(1,2*pageNum,2):
                url = f'https://search.jd.com/Search?keyword={key}&enc=utf-8&page={count}'
                data_list = find_things(window,canvas,int((count+1)/2),url,headers=headers)
                allList += data_list
            writeCSV(key,allList) 

def cancel():
    # 关闭窗口
    tk._exit()

def openFoloder():
    # 打开文件所在目录
    os.system(f"start explorer {(os.path.abspath(__file__)).replace(os.path.basename(__file__),'')}")

totalProgress = 0.0
nowProgress = 0.0
stepTime = 0

headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
        'Cookie':'__jdu=1575851007; shshshfpa=4478fa62-1038-a220-4758-a8f8fb33a94c-1565097369; shshshfpb=x4hKWK42Csod6TvzZu84ImA%3D%3D; areaId=12; ipLoc-djd=12-988-47821-0; user-key=32b86a54-cde1-4b0f-8e7f-dad6f1b932f7; cn=0; TrackID=1c4B8tjX-r2XhASx8iu7e5adXdhpDZI0u3dMCCodDdou1g-RjW53S1eGqwicdXj6h1vYP5bxG9VVna9UGOmbDmPqyTXCHzZxT4IXxSmGpt6I; pinId=hFytr_e6V50nLHoBcZu_IrV9-x-f3wj7; pin=jd_7b68875ae5a80; unick=jd_7b68875ae5a80; _tp=Hzv76TmStPIKW%2FuZQ7veOZR7pHLLLv2C9lI6Xf24Dlg%3D; _pst=jd_7b68875ae5a80; unpl=V2_ZzNtbUMFQhFyXU9deRlYUWIGFFwSA0cUcQxFXXofW1I1ABIPclRCFX0URldnGlkUZwcZX0dcQhFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zElEUQXALRlIuTQ9XYlATXUpeQRJwCxYGc0lfBjIDGghyZ0AVRQhHZHsdWgFuBxJcQ1FzJXI4dmR%2fG1kDYgoiXHJWc1chVE9deRBcACoDFltGXkcVdAlAZHopXw%3d%3d; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|a7fe4217debc4b01983642ac9a22d19d_0_0c056d88315e4470ae505528076fc21c|1577172844712; PCSYCityID=CN_320000_320500_320583; __jda=122270672.1575851007.1560329455.1577160883.1577172845.9; __jdc=122270672; 3AB9D23F7A4B3C9B=ZXRLLNONSMWITKSVYZS6PQJHXX7TJJVTFRAME4IE435S7TKFIZYO7S4HIG46CHUCDDT3P6XZ4CYML7RFLRDLB44REY; shshshfp=04eaf5816ad00cfbdea0bd85ee561794; __jdb=122270672.5.1575851007|9.1577172845; shshshsID=2875abb181079938819e4abd65980d36_3_1577173924061'
    }  

# 界面
window = tk.Tk()
# 若图片不存在,也能让程序启动
try:
    window.iconbitmap('JDicon.ico')
except:
    print('icon图片没有找到')

theName = '京东商城爬啊爬v0.2'
keyword = tk.StringVar()
pageCount = tk.StringVar()
countDown = tk.IntVar()
currentName = tk.StringVar()

window.title(theName)

window.geometry('665x250')

tk.Label(window, text='请输入您所需要搜索的商品名:').place(x=50, y=10)
tk.Entry(window,width=25,font=('Arial', 14),textvariable=keyword).place(x=50,y=35)

tk.Label(window, text='请输入您所需要搜索的页数:').place(x=50, y=60)
tk.Entry(window,width=25,font=('Arial', 14),textvariable=pageCount).place(x=50,y=85)
pageCount.set("1")

try:
    imgCanvas = tk.Canvas(window, height=138, width=250)
    logo = tk.PhotoImage(file='JD.jpg')
    imgCanvas.create_image(125, 0, anchor='n',image=logo)
    imgCanvas.place(x=350,y=10)
except:
    print('JD图片没有找到')

tk.Label(window, text='爬取进度:', ).place(x=50, y=150)
canvas = tk.Canvas(window, width=500, height=22, bg="white")
canvas.place(x=110, y=150)

tk.Label(window, text='剩余时间(秒):', ).place(x=50, y=180)
tk.Label(window, textvariable=countDown).place(x=125, y=180)
tk.Label(window, text='当前爬取:', ).place(x=50, y=210)
tk.Label(window, textvariable=currentName).place(x=110, y=210)

tk.Button(window, text='确认',  width=10, height=1, command=ok).place(x=150,y=115)
tk.Button(window, text='取消',width=10, height=1, command=cancel).place(x=250,y=115)
tk.Button(window, text='打开文件夹',width=10,height=1,command=openFoloder).place(x=50,y=115)

window.update()
window.mainloop()

不足

不足:
1.这里,因为逻辑是一页一页的获取商品数据,所以进度条的总进度不太好算,我就直接假设每一页有30个商品来算了,totalProgress就直接等于页数乘30来算了,剩余时间的计算也是假设每个商品数据大概花3秒来算的,后面时间不够了就没有改,就先做成这样吧,等有时间了再改.
2.定义的一些函数,作用分的不是太开
3.没有考虑getUrlContent函数如果返回None应该怎么处理