python爬虫图片下载

最新推荐文章于 2024-07-22 17:25:13 发布

artemisrj

最新推荐文章于 2024-07-22 17:25:13 发布

阅读量787

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/artemisrj/article/details/42434941

版权

python 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

全部代码http://download.csdn.net/detail/artemisrj/8330275

这次因为要帮人找素材，http://ieeexplore.ieee.org

要帮找某个期刊的图片，这个期刊比较好的一点就是它把所有的论文图片都放在了网上，一个个链接打开太麻烦，所以就把东西全下下来了。

爬虫代码主要参考下面这位博主。

http://www.cnblogs.com/chenkun24/archive/2012/10/06/2713348.html

1.首先我要获得所有发行论文的list，所以我就从past issues的目录里面抓了各个issue的链接。通过审查元素知道那一个div包含所有的链接，把这个div存进文件，存储在本地。

然后写正则表达式

def readyearlink(url):
    text=''
    for line in open(url):
        text=text+line
    
    reg="<ul id=\"pi-(.*?)\" style=\".*?\">(.*?)</ul>"
    thedata=re.compile(reg).findall(text)
    return thedata

这个函数主要用来处理本地文件，额，url没改过来reg是正则表达式，(.*?)很好用，就是匹配任意字符串，加括号的代表你要提取的，不加括号的代表有变动的。

这个函数输入本地链接，输出两个数据，每个年份，和年份对应的各个issue的链接

然后对于每个年份，我都抽了他们的issue分析，

def dealissue(data):
    year=data[0]
    print year
    reg="<li> <a href=\".*?isnumber=(.*?)&.*?Issue:\s*(.*?)\s*</a> </li>"
    thedata=re.compile(reg).findall(data[1])
    print thedata
    for plink in thedata:
        setdir(year+'_'+plink[1])
        link='http://ieeexplore.ieee.org'+'/xpl/tocresult.jsp?isnumber='+plink[0]+'&punumber=2945'
        deallist(link,plink[0])
        global strs
        f=open('readme.csv','wb')
        f.write(strs)
        f.close()

<li> <a href="/xpl/tocresult.jsp?isnumber=4384585&punumber=2945">  Vol: 14   Issue: 1    </a> </li>

这个函数的处理的是每个ul。每个ul里面是一年的issue，下面的是格式，我主要是抽取每个issue的编号，上面的就是4384585，以及是哪个issue，比如说上面的是1

thedata 存储了这两个信息，setdir是设置当年目录

def setdir(year):
    path="D:/down/"+year
    if not os.path.exists(path):
        os.makedirs(path)
    os.chdir(path)
    os.getcwd()
    global strs
    strs=""

因为我想每个issue一个目录，设置当前目录后，文件就可以存文件名了。

strs是存了每个paper的编号方便大家查找，存在每个文件的目录下面。

deallist是针对每个issue进行处理

def deallist(link,inum):
    global strs
    conn=urllib.urlopen(link)
    nn=conn.read()
    if len(nn)==292:
        return False
    else:
        reg="<span id=\"art-abs-title-(.*?)\">(.*?)</span>"
        articles=re.compile(reg).findall(nn)
        for article in articles:
            title=article[1]
            articleNum=article[0]
            strs=strs+articleNum+','+title+','+'http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber='+articleNum+'\n'
            paperpic(articleNum,inum)

deallist输入issue的链接和编号，然后通过链接连接每个issue的paper列表，找出每个paper的标题，编号，等信息，然后用来下载图片

def paperpic(articleNum,inum):
    print 'article'+articleNum
    link='http://ieeexplore.ieee.org/ielx5/2945/'+inum+'/'+articleNum+'/html/img/'+articleNum+'-fig-'
    i=1
    while True:
        if savepic(link+str(i)+"-large.gif",articleNum+"_"+str(i)):
            i=i+1
            continue
        else:
            break

paperpic用来下载图片

def savepic(url,name):
    conn=urllib.urlopen(url)
    nn=conn.read()
    if len(nn)==292:
        return False
   
    else:
        f=open(name+'.jpg','wb')
        f.write(nn)
        f.close()
        print name 
        return True

def paperpic(articleNum,inum):
    print 'article'+articleNum
    link='http://ieeexplore.ieee.org/ielx5/2945/'+inum+'/'+articleNum+'/html/img/'+articleNum+'-fig-'
    i=1
    while True:
        if savepic(link+str(i)+"-large.gif",articleNum+"_"+str(i)):
            i=i+1
            continue
        else:
            break

全部程序如下

import urllib
import re
import os

strs=''

def savepic(url,name):
    conn=urllib.urlopen(url)
    nn=conn.read()
    if len(nn)==292:
        return False
   
    else:
        f=open(name+'.jpg','wb')
        f.write(nn)
        f.close()
        print name 
        return True

def paperpic(articleNum,inum):
    print 'article'+articleNum
    link='http://ieeexplore.ieee.org/ielx5/2945/'+inum+'/'+articleNum+'/html/img/'+articleNum+'-fig-'
    i=1
    while True:
        if savepic(link+str(i)+"-large.gif",articleNum+"_"+str(i)):
            i=i+1
            continue
        else:
            break
        
def readyearlink(url):
    text=''
    for line in open(url):
        text=text+line
    
    reg="<ul id=\"pi-(.*?)\" style=\".*?\">(.*?)</ul>"
    thedata=re.compile(reg).findall(text)
    return thedata
    
def setdir(year):
    path="D:/down/"+year
    if not os.path.exists(path):
        os.makedirs(path)
    os.chdir(path)
    os.getcwd()
    global strs
    strs=""
    
def deallist(link,inum):
    global strs
    conn=urllib.urlopen(link)
    nn=conn.read()
    if len(nn)==292:
        return False
    else:
        reg="<span id=\"art-abs-title-(.*?)\">(.*?)</span>"
        articles=re.compile(reg).findall(nn)
        for article in articles:
            title=article[1]
            articleNum=article[0]
            strs=strs+articleNum+','+title+','+'http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber='+articleNum+'\n'
            paperpic(articleNum,inum)
         
def dealissue(data):
    year=data[0]
    print year
    reg="<li> <a href=\".*?isnumber=(.*?)&.*?Issue:\s*(.*?)\s*</a> </li>"
    thedata=re.compile(reg).findall(data[1])
    print thedata
    for plink in thedata:
        setdir(year+'_'+plink[1])
        link='http://ieeexplore.ieee.org'+'/xpl/tocresult.jsp?isnumber='+plink[0]+'&punumber=2945'
        deallist(link,plink[0])
        global strs
        f=open('readme.csv','wb')
        f.write(strs)
        f.close()
        
    
url="D:\\down/h.html"
ll=readyearlink(url)
for item in ll:
    dealissue(item)

通过这次的扒图片的过程，我学的东西有

urllib.urlopen 来加载网页

(.*?) .*? 来匹配任意字符串 \s匹配空格 \s*代表0个或者多个空格

global全局变量要在def里面定义

一层一层地包裹着写代码。