Python 爬取网页图片

最新推荐文章于 2024-07-22 17:25:13 发布

ytusdc

最新推荐文章于 2024-07-22 17:25:13 发布

阅读量4.4k

点赞数 2

分类专栏： Python 文章标签：爬取图片

本文链接：https://blog.csdn.net/ytusdc/article/details/78652019

版权

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

因为训练数据需求，需要爬取一些图片做训练。爬取的是土巴兔网站的家装图片根据风格进行爬取图片 http://xiaoguotu.to8to.com/list-h3s13i0

可以看到该页面上每一个图片点进去是一个套路链接，我想要爬取的是每一个套图内的所有图片。同时这个网页是翻页的，拉倒最后可以看到。

因此我需要获取每一页的html，从这个html中得到所有套图的链接。下载所有套图。

1、在Chrome浏览器中，点击F12进入调试界面，选中左上角的箭头，可以在页面点击时定位到，Element的位置，下图是点击第二页时定位到的位置。可以看到每一页都很有规律，第二页网址是 http://xiaoguotu.to8to.com/list-h3s13i0p2 ，而第二页在Element 的标签是 /list-h3s13i0p2 (在当前页看不到，可以翻页后再看第二页的标签)，所以其他翻页的网页我们直接拼接就可以。总页数可以获，需要注意的是，如果总页数只有几页，最后一页不会有class = “last” 的标记。这也我们就可以遍历获取所有页的html地址。

2、每一页中的图片套图的链接获取，跟步骤一一样，定位到套图位置，如下图，div 是各个套图，可以遍历所有套图，点击进去之后，可以不看到套图的地址是http://xiaoguotu.to8to.com/c10038739.html 跟图中红色标记相同，所以我们可以自己组成，链接套图的html

3、进入套图链接，同样上面的操作，定位到套图位置，但是此页面是翻页的，所以仔细查找可以找到一下内容，复制这些链接发现这就是大图的地址，

我们直接把这些链接获得去下载图片即可

最后上代码

#coding=utf-8
import urllib.request
import time
import os
from lxml import etree
from selenium import webdriver
#time.sleep(2)
def get_html(url,filename):

    #此处本来想用urllib.request 接口获得html的内容，但是获取后返现html格式有一些不对
    #导致后面不能使用xpath定位到所需要的元素，所以此处使用火狐重新打开后获取html，安装geckodriver.exe
    # 后面有空会查找一些原因，反复开网页太耗时
    browser = webdriver.Firefox()
    browser.get(url)
    html_source = browser.page_source
    #if(os.path.exists(filename)):
      #  os.remove(filename)
    with open(filename, "w", encoding="utf-8") as f:
        f.write(html_source)
    browser.close()
    time.sleep(0.1)
#获取
def getChildHtml(filename):
    with open(filename, "r", encoding="utf-8") as f:
        xml_str = f.read()
        html = etree.HTML(xml_str)
        #得到总页数
        action_items = html.xpath("body/div[2]/div[5]/div[1]/a[last()-1]")
        for action_item in action_items:
            if action_item.attrib is not None and "href" in action_item.attrib:
                lastpage_name=action_item.attrib["href"]
                p_index=lastpage_name.rfind("p")
                startpage_name= lastpage_name[0:p_index]
                total_page = lastpage_name[p_index+1:]

       # for index in range(int(total_page)):
                index =1
        while index <= int(total_page):
            if(index == 1):  #第1页比较特别
                childpage =  startpage_name
            else:
                childpage = startpage_name + "p" + str(index)
            child_url = baseurl + childpage
            childfile_html = "html/"+ childpage +".html"
            get_html(child_url,childfile_html)
            SaveChildhtml(childfile_html)
            index += 1
def SaveChildhtml(filename):
    with open(filename, "r", encoding="utf-8") as f:
        xml_str = f.read()
        html = etree.HTML(xml_str)
    child_htmls = html.xpath("body/div[2]/div[4]/div[2]/div")
    for child_html in child_htmls:
        if child_html.attrib is not None and "oldcid" in child_html.attrib:
            htmlname = child_html.attrib["oldcid"]
            image_url = "http://xiaoguotu.to8to.com/c" + htmlname + ".html"
            image_filename = "html/" + htmlname + ".html"
            get_html(image_url, image_filename)
            getimg(image_filename)
    time.sleep(0.1)

#套图页面爬取所有图片
def getimg(filename):
    with open(filename, "r", encoding="utf-8") as f:
        xml_str = f.read()
        html = etree.HTML(xml_str)

        savefold_items=  html.xpath("body/div[4]/div[1]/div[2]/div[1]/div[1]/div[1]/img")
        #文件夹暂时不判断重名了
        img_savepath = ""
        for savefold_item in savefold_items:
            if savefold_item.attrib is not None and "src" in savefold_item.attrib and "alt" in savefold_item.attrib:
                savefold = savefold_item.attrib["alt"]
                img_savepath=SavePath + savefold
                isExists = os.path.exists(img_savepath)
                if not isExists:
                    os.makedirs(img_savepath)
        action_items = html.xpath("body/div[3]/img")
        index=0
        for action_item in action_items:
            if action_item.attrib is not None and "src" in action_item.attrib and "alt" in action_item.attrib:
                img_src = action_item.attrib["src"]
                img_name = action_item.attrib["alt"]
                #filetext = open("D:\\errortext.txt", 'a')
                try:
                    urllib.request.urlretrieve(img_src, img_savepath+"\\"+img_name+ str(index) +".jpg")
                except:
                    print(filename + "error")
                  # filetext.write(filename)
                  # filetext.write('\n')
               # filetext.close()
            index = index + 1
            time.sleep(0.1)

savename= "html/text.html"
#url = "http://xiaoguotu.to8to.com/list-h3s9i0p3"

SavePath = "D:\\img\\简约\\"     #图片文件保存位置
#url = "http://xiaoguotu.to8to.com/c10037632.html"
baseurl = "http://xiaoguotu.to8to.com/"
url = "http://xiaoguotu.to8to.com/list-h3s13i0"
get_html(url,savename)
getChildHtml(savename)
#getimg(savename)

说明：因为本人只是爬取数据，不需要完全自动化的过程，所以代码中很多路径都写死了，所以如果有需求可以自己更改。

这个代码只是爬取的简约风格的所有图片，其他风格，需要手动更改 SavePath 设置不同风格的文件夹的，和风格开始的网页地址。

SavePath 的风格名称同样可以从Element 中获取，懒得写了。